<<

The Standard Version 3.0

The

ADDISON–WESLEY

An Imprint of Addison Wesley Longman, Inc. , Massachusetts · Harlow, England · Park, California Berkeley, California · Don Mills, Ontario · Sydney Bonn · Amsterdam · Tokyo · Mexico City Many of the designations used by manufacturers and sellers distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If these files have been purchased on computer-readable media, the sole remedy for any claim will exchange of defective media within ninety days of receipt. Dai Kan- Jiten used as the source of reference codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval , or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. Printed in the United States of America. Published simultaneously in Canada. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The camera-ready master of the text and code charts was printed on a Hewlett-Packard LaserJet 4000T. The Han radical- was printed by Apple Computer, Inc. The following companies and organizations supplied :

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype , Inc. Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. Cover motifs: (front), (back). CD-ROM: Phaistos Disk. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A- on the Web: www.awl.com/cseng/ 1 2 3 4 5 6 7 8 9—CRW—0403020100 First printing, January 2000. This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Preface

This book, The Unicode Standard, Version 3.0, is the authoritative source of information on the Unicode standard, the international character code for information processing that includes all major scripts of the world and is the foundation for develop- ment of software for worldwide use. As well as encoding characters used for written com- munication in a simple and consistent manner, the Unicode Standard defines character properties and algorithms for use in implementations. Version 3.0 expands on material from Versions 2.0 and 2.1 and supersedes all other previ- ous versions. The previous versions of the Unicode Standard are: • The Unicode Standard, Version 1.0, Volume 1 (1991) • The Unicode Standard, Version 1.0, Volume 2 (1992) • The Unicode Standard, Version 1.1, Unicode Technical Report #4 (1993) • The Unicode Standard, Version 2.0 (1996) • The Unicode Standard, Version 2.1, Unicode Technical Report #8 (1998) Major additions to Version 3.0 include: • conformance rules for transformation formats • new scripts including Ethiopic, Khmer, Mongolian, , and Sinhala • restructured and enhanced character block descriptions • clarified bidirectional algorithm • updated implementation guidelines •a Shift-JIS index The Unicode Standard maintains consistency with the international standard ISO/IEC 10646. Version 3.0 of the Unicode Standard corresponds to ISO/IEC 10646-1:2000.

0.1 About the Unicode Standard This book defines Version 3.0 of the Unicode Standard. The general principles and archi- tecture of the Unicode Standard, requirements for conformance, and guidelines for imple- menters precede the actual coding information. Useful ancillary information is given in the appendices. The accompanying CD-ROM contains tables of use to implementers and all technical reports published to date.

Concepts, Architecture, Conformance, and Guidelines The first five chapters of Version 3.0 introduce the Unicode Standard and provide the information an engineer needs to produce a conforming implementation. Basic text pro- cessing, working with combining marks, encoding forms, and doing

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxv 0.1 About the Unicode Standard Preface layout are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode. Chapter 1 introduces the standard’ basic concepts, design basis, and coverage, and discusses basic text handling requirements. Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall char- acter properties, and the use of combining marks. Chapter 3 constitutes the formal statement of conformance. This chap- ter also presents the normative algorithms for three processes: the canonical ordering of combining marks, the encoding of Korean by conjoining jamo, and the formatting of bidirectional text. Chapter 4 describes character properties in detail, both normative (required) and informative. Tables giving additional character property information appear on the CD-ROM. Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding to other standards.

Character Block Descriptions Chapters 6 through 13 contain the character block descriptions that give basic information about each or collection and may discuss specific characters or pertinent layout information. Chapter 6 describes the general characters. Chapter 7 presents the European Alphabetic scripts, including , Greek, Cyrillic, Armenian, Georgian, Runic, , and associated combining marks. Chapter 8 presents the Middle Eastern, right-to-left scripts: Hebrew, , Syriac, and . Chapter 9 covers the South and Southeast Asian scripts, including , , , , , , , , , Sinhala, Tibetan, , Lao, Khmer, and Myan- mar. Chapter 10 presents the East Asian scripts, including Han, , , Hangul, , and Yi. Chapter 11 presents other scripts, including Ethiopic, Cherokee, Cana- dian Aboriginal Syllabics, and Mongolian. Chapter 12 presents symbols, including currency, letterlike and technical symbols, and mathematical operators. Chapter 13 describes special characters such as the Private Use Area, sur- rogates, and .

Charts and Index The next two chapters document the Unicode Standard’s character code assignments, their names and important descriptive information, and Han indices that aid in locating specific ideographs encoded in Unicode. xxvi Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Preface 0.1 About the Unicode Standard

Chapter 14 gives the code charts and the Character Names List. The code charts contain the normative character encoding assignments, and the names list contains normative information as well as useful refer- ences and informational notes. Chapter 15 provides a radical-stroke index to East Asian ideographs, as well as a Shift-JIS index.

Appendices and Tables The appendices contain detailed background information on important topics: character encoding systems, submission of proposals, and the history of Unicode and its relationship to ISO/IEC 10646. Appendix A describes the history of in the Unicode Standard. Appendix gives instructions on how to submit characters for consider- ation as additions to the Unicode Standard. Appendix details the relationship between the Unicode Standard and ISO/IEC 10646. Appendix lists the changes to the Unicode Standard since Version 2.0. The appendices are followed by a glossary of terms, a bibliography, and two indices: an index to Unicode characters and an index to the text of Chapters 1 through 15.

The Unicode Character Database and Technical Reports The Unicode Character Database is the name for a collection of files that contain character code values, character names, and character property data. It is described more fully in the file UnicodeCharacterDatabase.. Version 3.0.0 of the database is provided on the accompanying CD-ROM. Updates and revisions will be made available online. For infor- mation on the latest available version see: http://www.unicode.org/unicode/standard/versions/ The following Unicode Technical Reports are formally part of this standard: • UTR #11: East Asian Width, Version 5.0 • UTR #13: Unicode Guidelines, Version 5.0 • UTR #14: Line Breaking Properties, Version 6.0 • UTR #15: Unicode Normalization Forms, Version 18.0 The latest available version of these reports is provided on the CD-ROM. Updates and revi- sions will be made available online. For information on the latest available version, see http://www.unicode.org/unicode/standard/versions/.

On the CD-ROM The CD-ROM contains the Unicode Character Database, which gives character codes, char- acter names, character properties, and decompositions for decomposable or compatibility characters. In addition to the Unicode Character Database and Unicode Technical Reports that are part of this standard, the CD-ROM also contains additional technical reports (cov- ering topics such as compression, collation, and transformation formats), as well as property-based mapping tables (for example, tables for case) and transcoding tables for

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxvii 0.2 Notational Conventions Preface international, national, and industry character sets (including the Han cross-reference table). For the complete contents of the CD-ROM, see its READ file. Please consult the Unicode Consortium’s online resources (see Section 0.3, Resources) to obtain the most up- to-date versions of the materials on the CD-ROM.

0.2 Notational Conventions Throughout this book, certain typographic conventions are used. In running text, an indi- vidual Unicode value is expressed as +nnnn, where nnnn is a four-digit number in hexa- notation, using the digits 0–9 and the letters A– (for 10 through 15, respectively). • U+0416 is the Unicode value for the character named   -  . In tables, the U+ may be omitted for brevity. A range of Unicode values is expressed as U+xxxx→U+yyyy, or U+xxxx—U+yyyy, or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the , long , or two dots indicate a contiguous range inclusive of the endpoints. • The range U+0900→U+097F contains 128 character values. All Unicode characters have unique names, which are identical to those of the English- edition of International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A through , digits, , and -minus; this convention makes it easy to generate computer-language identifiers automatically from the names. Unified East Asian ideographs are named   -, where X is replaced with the Unicode value—for example,   - 4E00. The names of are generated algorithmically; for details, see Hangul Names in Section 3.11, Conjoining Jamo Behavior. In running text, a formal Unicode name is shown in small capitals (for example,    ), and alternative names (aliases) appear in italics (for example, umlaut). Italics are also used to refer to a text that is not explicitly encoded (for example, pasekh alef) or to set off a foreign word (for example, the Welsh word ynghyd). Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/. The symbols used in the character names list are described at the beginning of Chapter 14, Code Charts. In the text of this book, the word “Unicode” when used alone as a noun refers to the - code Standard. In this book, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.

Extended BNF The Unicode Standard and technical reports use an extended BNF format for describing syntax. As different conventions are used for BNF, Table 0-1, Extended BNF, lists the nota- tion used here. A sequence of characters is sometimes listed in text with , such as or . Character Classes. A character class is constructed from one or two base sets. It is either a single base set, the of a base set, or the (set) difference between two base sets. The

xxviii Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Preface 0.2 Notational Conventions

Table 0-1. Extended BNF

Symbols Meaning x := ... production rule x the sequence consisting of x then y x* zero or more occurrences of x x? zero or one occurrence of x x+ one or more occurrences of x x | y either x or y ( x ) for grouping x || y equivalent to (x | y | (x y)) { x } equivalent to (x)? "abc" string literals ( “_” is sometimes used to denote space for clarity) ’abc’ string literals (alternative form) \u1234 Unicode characters within string literals or character classes \v00101234 Unicode scalar values within string literals or character classes U+HHHH Unicode character literal: equivalent to ‘\uHHHH’ U-HHHHHHHH Unicode character literal: equivalent to ‘\vHHHHHHHH’ charClass character class (syntax below) base sets themselves are bounded by brackets, and contain lists of characters, ranges of characters, general categories, or of general categories. The syntax follows: charClass := baseSet | '¬' baseSet | baseSet '-' baseSet baseSet := '[' item (','? item)* ']' item := char | char '-' char | '{' '¬'? category '}' General categories are defined in Chapter 4, Character Properties, such as [{Uppercase Let- ter}] for uppercase . Main categories such as [{Mark}] are the equivalent of a list of multiple subcategories: [{Non-Spacing Mark}{Spacing Combining Mark}{Enclosing Mark}]. Examples are found in Table 0-2, Character Class Examples. Table 0-2. Character Class Examples

Syntax Matches [a-z] English lowercase letters [a-z]-[c] English lowercase letters except for c ¬[c] all characters but c [0-9] European decimal digits [\u0030-\u0039] (same as above, using Unicode escapes) [0-9,A-F,a-f] hexadecimal digits [{Letter},{Non-Spacing Mark}] all letters and non-spacing marks [{},{Mn}] (same as above, using abbreviated notation) [{¬Cn}] all assigned Unicode characters [\u0600-\u06FF]-[{Cn}] all assigned Arabic characters

Operators Operators used in this standard are listed in Table 0-3, Operators.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxix 0.3 Resources Preface

Table 0-3. Operators ÷ allow break here (see Section 5.15, Locating Text Element Boundaries) × do not allow a break here ò is transformed to, or behaves like / integer division (rounded down) % modulo operation; equivalent to the integer remainder for positive numbers.

0.3 Resources The Unicode Consortium provides a number of online resources for obtaining informa- tion and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.

Unicode Web Site • http://www.unicode.org

Unicode Anonymous FTP Site • ftp://ftp.unicode.org

Unicode Public Mailing List • [email protected] Subscription instructions for the public mailing list are posted on the Unicode Web site.

How to Contact the Unicode Consortium Contact the Consortium for membership information and to order publications (includ- ing additional copies of this book). • Electronic mail : [email protected] • Postal address: P.O. Box 391476 Mountain View, 94039-1476 USA Please check the Web site for up-to-date contact information, including telephone, fax, and delivery address.

xxx Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Acknowledgments

The production of The Unicode Standard, Version 3.0, is due to the dedication of many individuals for over a decade. would like to acknowledge those who were involved in Version 3.0, and particularly the following individuals, whose major contributions were central to the design, authorship, and review of this book. Ken Whistler has been the driving force behind Version 3.0. As managing editor, had responsibility for all aspects of production; in particular, he designed the new chapter structure. Ken meticulously updated the Unicode Character Database, including adding all the new characters and their properties. He also maintained the Character Names List and supplied many of the annotations. Ken co-authored three Unicode Technical Reports, and contributed to the descriptions for many of the new scripts, such as Myanmar and Khmer. He was responsible for the vast majority of the original character block descriptions and made major contributions to the encoding of Indic scripts, IPA, and symbols. Julie D. Allen coordinated the editorial review meetings. As editor, she kept track of edits and text from multiple sources and in various formats, and formatted the text in FrameMaker. She worked with the authors to improve the wording and clarity of the text, and also helped resolve problems resulting from platform conversion. Joan Aliprand chaired the Unicode Technical Committee while Version 3.0 was under development. She is responsible for the front matter, Glossary, References, and General Index. She also contributed to Middle Eastern Scripts. Joe Becker is acknowledged for his initial work and promotion of a 16-bit full encoding and his overall contributions to the creation of the Unicode Standard and editing of this book. has been essential to the development of Version 3.0. Mark has led the way in many aspects of overall design and implementation of the Unicode Standard. He has been the chief architect for the definition of conformance and the bidirectional algorithm. He contributed significant revisions and enhancements to Implementation Guidelines, Bidirec- tional Behavior, Syriac, and the Unicode Character Database. Five of the Unicode Technical Reports are Mark’s work, and he co-wrote two others. His previous contributions include algorithms for canonical equivalence, conjoining Hangul, shaping for specific scripts, and the development of the Unicode Character Database. collected or created many of the fonts used in this standard, and provided typographical review and assistance. He also reviewed the code charts and Character Names List, and provided many annotations. Block descriptions for several of the new scripts are based on drafts prepared by Michael. Asmus Freytag was responsible for the chapters Punctuation, Symbols, and Special Areas and Format Characters. He made significant contributions to Syriac, Bidirectional Behavior, and to the chapters General Structure, European Alphabetic Scripts, and Implementation Guidelines. Asmus created custom formatting software, negotiated with the font houses for donated fonts, and oversaw production of the code charts. He also created the Shift-JIS Index, and he wrote two Unicode Technical Reports and co-authored another. Previously, Asmus contributed substantial text on implementation aspects of the Unicode Standard.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. iii Acknowledgments

John . Jenkins was responsible for East Asian Scripts and contributed substantial new material for that chapter. He maintained and extended the Unihan database, and created the Han Radical-Stroke Index to the ideographic content of the standard. John converted all of the Version 2.0 illustrations to Adobe Illustrator format. He created the fonts for the KangXi and CJK radicals. In addition, he collected fonts from many sources and fine-tuned them for use. John was responsible for the original Han cross-reference tables. Mike Ksar led the effort to synchronize Version 3.0 and the second edition of ISO/IEC 10646-1. He contributed to Middle Eastern Scripts, as well as to Appendix C and Appendix D. Mike also thoroughly reviewed the code charts and the text. Rick McGowan coordinated work on new scripts, and was involved with many of the sub- stantial additions published in Version 3.0. He organized a group of Tibetan experts who developed the extensions for . Rick contributed to the sections on Ogham, Thaana, Tibetan, and Canadian Aboriginal Syllabics. He was responsible for mastering and producing the CD-ROM. Rick contributed to development of the original algorithm for canonical equivalence. Lisa Moore rewrote many parts of the text, including the Preface and Introduction, the chapter General Structure, and Appendix D. She wrote the section on Cherokee, and contrib- uted to the explanations on encoding forms and transformation formats. As current Chair of the Unicode Technical Committee, Lisa oversaw the finalization of the content of Ver- sion 3.0. She previously contributed substantially to character properties. Michel Suignard painstakingly recoded and converted fonts for use in this book. He also contributed to several Unicode Technical Reports. Michel previously contributed to the development of the Unicode Character Database and mapping tables. Fonts were essential for the production of this book. In addition to the individuals men- tioned above and the companies and organizations named in the colophon, fonts were contributed by Michael Stone (Armenian), Al Webster (Cherokee), Oliver Corff (Yi), Yan- nis Haralambous (Syriac), and George Kiraz and Paul Nelson (Syriac). John . Fiscella designed the fonts for symbols and many of the alphabetic scripts. Thomas Milo designed the Arabic font. The font for CJK Extension A was designed by Technical Supervisor Zheng Long and Hua Weicang. Cora Chang created the font for patterns. Critical com- ments and work on fonts are due to Heidi Jenkins, and Dr. Virach Sornlertlamvanich (for Thai). Ralph Buchalter transferred the Version 2.0 master files to the Windows environment, and implemented the new chapter structure. Phil Huston provided FrameMaker expertise. Steve Mehallo designed the interchapter separators, incorporating examples of from numerous sources. Kamal Mansour (Monotype) initiated and coordinated this graphic design process. Both the text and the code charts were reviewed critically by experts. Edwin Hart and Bruce Paterson were principal reviewers of this book. The Editorial Committee appreciates the expert feedback on the code charts provided by Keiko Nagano and Tetsuji Orita of IBM , Hugh McGregor Ross, and, for specific scripts, Yannis Haralambous (Greek and Thai), Constantine Stathopoulos (Greek), and Dr. Virach Sornlertlamvanich (Thai). - bara Beeton of the American Mathematical Society reviewed the math symbols, and Donald . Knuth gave additional input. Cora Chang thoroughly checked the data for CJK ideographs. Mark Davis and Laura Werner did extensive work to verify the consistency of character properties in the Unicode Character Database. Fucheng of RLG helped with the bibliography. The development of this book would not have been possible without the support of the office staff of Unicode, Inc., and the hard work of Mike Kernaghan, as operational manager of the office. We thank Magda Danish, Dee Domingo, Steve Greenfield, Cheryl Morton, iv Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Acknowledgments and Julia Oesterle, all of whom assisted with Version 3.0 in numerous ways. Sarasvati minded the mailing lists. The technical content of the Unicode Standard is determined by the Unicode Technical Committee. Contributors to the work of the UTC include representatives of Full and Asso- ciate Members, Specialist and Individual Members, and Unicode Officers, as well as invited experts and liaisons. Version 3.0 would not have been possible without their creative work and critical thinking over the past three years, augmented by the support of all member representatives: Dean Abramson (), Glenn Adams (Spyglass), Joan Aliprand (RLG), Joshua Arai (Justsystem), Joe Becker (Xerox), Barbara Beeton (American Mathe- matical Society), Chris Boyle (Novell), Stefan Buchta (Oracle), Richard C. Buhler (Novell), Don Carroll (Hewlett-Packard), Steve Cohen (Basis Technology), Lee Collins (Apple), Ken Cortese (Booz, Allen), David . Craig (NCR), Frank Cruz (Columbia University), Mark Davis (Unicode President), Martin Dürst (W3C), Peter Edberg (Apple), John M. Fiscella (Production First), Asmus Freytag (Unicode Vice-President), Deborah Goldsmith (Apple), Edwin F. Hart (SHARE), Hideki Hiura (Sun), Paul Hoffman (Internet Mail Consortium), Lloyd Honomichl (Novell), John Jenkins (Apple), Nailong Jiang (Quark), Robert Keeble (Quark), Susan Kline (Hewlett-Packard), Tatsuo Kobayashi (Justsystem), Mike Ksar (Hewlett-Packard), Michael Kung (Oracle, then Microsoft), Wai Man Long (Compaq), Kamal Mansour (Monotype), Benson Margulies (Basis Technology), Gianni Mariani (SGI), Rick McGowan (NeXT, then Apple), Mike McKenna (Sybase), Thomas Milo (DecoType), Matthias Mittelstein (SAP), Lisa Moore (IBM), Nobuyoshi Mori (SAP), Rudolf Munz (SAP), Brendan Murray (Lotus), Nelson Ng (Oracle), Thomas Nielsen (MGI), Cleo O’Brien (SGI), Sandra Martin O’Donnell (Compaq), Stephen P. Oksala (Uni- sys), Shripad Patki (Sun), Nathan Potechin (MGI), Mirko Raner (MATHEMA), Wendy Rannenberg (Compaq), Gary Roberts (NCR), Murray Sargent (Microsoft), Bernhard Schilling (SAP), Karen Smith-Yoshimura (RLG), Neil Soiffer (Wolfram Research), Mark Son-Bell (Sun), Michel Suignard (Microsoft), Tex Texin (Progress Software), . S. Umama- heswaran (IBM), Marc von Holzen (SGI), Ken Whistler (Sybase), Michael Wiedeking (MATHEMA), Richard A. Willis (Reuters), Arnold F. Winkler (Unisys), Bill Wolf (Booz, Allen), Misha Wolf (Reuters), Eugene Wong (Hewlett-Packard), Jianping Yang (Oracle), Michael Yau (Oracle), Douglas-Val Ziegler (Xerox). A significant part of the UTC’s work was devoted to refining the Unicode Bidirectional Algorithm. The task was carried out by the Bidirectional Subcommittee, consisting of Asmus Freytag (ASMUS, Inc.), Mike Ksar (Hewlett-Packard), Mati Allouche, Mark Davis, Doug Felt, Gidali, Lisa Moore (all from IBM), F. Avery Bishop, David Brown, John McConnell, Paul Nelson (all from Microsoft), and Jony Rosenne (QSM Programming). The reference implementations on the CD-ROM were developed by Doug Felt and Asmus Freytag. The Arabic Ad Hoc group reviewed the existing repertoire and decomposi- tions. Members of the Ad Hoc group were as follows: Joe Becker (Xerox), Asmus Freytag (ASMUS, Inc.), Mike Ksar (Hewlett-Packard), Kamal Mansour (Monotype), Thomas Milo (DecoType), Khaled Sherif (IBM), Ken Whistler (Sybase). The Unicode Technical Committee has worked closely with Technical Committee L2 of the National Committee for Information Technology Standards (NCITS). We appreciate the cooperation of Chair Arnold Winkler and Vice-Chairs Joan Aliprand and Lisa Moore. Arnold Winkler established and efficiently maintained the Web-based archive of L2 docu- ments that was crucial to the development of Version 3.0. The growth of the synchronized character repertoires of the Unicode Standard and Inter- national Standard ISO/IEC 10646 reflects a worldwide effort conducted over many years. With Version 3.0, the Unicode Standard encodes all of the major modern scripts of the

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. v Acknowledgments world. We express deep appreciation to the following experts who shared their specialized knowledge to bring about this achievement: • For Runic: the ISORUNES project of the Swedish Central Board of National Antiquities (Helmer Gustavson, leader, Olle Järnefors, James Knirk, Svante Lagman, Ray Page, Marie Stoklund, and Klaus Düwel). • For Ogham: Michael Everson and Marion Gunn. • For Syriac: George Kiraz and Paul Nelson, who provided the draft block intro- duction for Syriac with the assistance of their colleague Sargon Hasso. Also those who advised them: Dr. Sebastian Brock (Oxford University), Dr. . Coak- ley (Harvard University), Dr. Jan Wilson (Brigham Young University), and Dr. Edward Odisho (North Eastern Illinois University). • For Thaana: the Maldivian Students’ Association and Husine Zahid. • For Sinhala: Lloyd Anderson, Lee Collins, Camillus Jayewardena (University of Ruhuna, Matara, Sri Lanka), Vasantha Karunaratne, Hugh McGregor Ross, and Mettavihari (IBRIC—International Buddhist Research & Information Centre, Sri Lanka). • For Tibetan extensions: Lloyd Anderson, Robert Chilton, Michael Everson, Christopher Fynn, Peter Lofting, Sam Sirlin, Valeriy E. Ushakov, and especially Tony Duff, who drafted the block introduction for Tibetan. • For Myanmar: the Myanmar Information Technology Standardization Com- mittee (represented by Dr. Aung Maw, Khin Maung Lwin, Dr. Kyaw Thien, Thaung Tin, and Thien Htut), and also F. Avery Bishop, Lee Collins, Andy Daniels, Zaw Htut, and Hugh McGregor Ross. • For Khmer: Maurice Bauhahn, F. Avery Bishop, Andy Daniels, Anshuman Pan- dey, and Cathy Wissink. • For Ethiopic: the Committee for the Standardization of Ethiopian Script (Abass Alamnehe, Fesseha Atlaw, Yitna Firdyiwek, Yonas Fisseha, Samuel Kinde, Ter- refe Ras-Work, and especially Daniel Yacob). • For Cherokee: Charles Gourd of the Cherokee Nation of Oklahoma. • For Canadian Aboriginal Syllabics: the Canadian Aboriginal Syllabic Encoding Committee, as well as Alain LaBonté, Henry Leperlier, and especially Dirk Ver- meulen. • For Mongolian: Chen Zhuang, Professor Choijingzhab, Oliver Corff, Myatav Erdenchimeg, Ji Yan, Yumbayariin Namsarai, and, in particular, Richard Moore, who drafted the block introduction for Mongolian. • For the ideographs of Vertical Extension A, the Kangxi Radicals, and the CJK Radicals Supplement: all of the members of the IRG. Major contributors included Zhang Zhoucai (rapporteur), Wang Xiaoming, Chen Zhuang, Yonghe, Bao Huiyun, Zheng Long, Liu Jiayu; Takayuki Sato, Satoshi Yamamoto, Tateo Koike, Eiji Matsuoka, Tatsuo Kobayashi; Lee Joonsuk, Lee Jaehoon, Suh Kyung ; Ngo Trung Viet, Nguyen Quang; Lu Qin, Lee Kin Hong; C. C. Hsu, Emily - Hsu, . C. Kao; Michael Kung, Hideki Hiura, and John Jen- kins. • For advice on Hebrew: Dr. Bella Hass Weinberg. • For advice on Indic scripts: James Agenbroad, Jeroen Hellingman, and Antoine Lecault. vi Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Acknowledgments

• For assistance with character properties of Thai: Ranat Thopunya and Suwit Srivilairith (IBM ). • For contributions to the encoding of North American scripts: Brunner. The Unicode Standard, Version 3.0, would not have been possible without those who made important contributions to earlier versions: Glenn Adams for his work on the character/ model, Devanagari shaping behavior, and the definition of conformance, and for the initial glossary; Lee Collins, for his painstaking work in the integration and extension of mappings from many different sources to create the original Unicode Han unification; Edwin Hart for his work on the character/glyph model; Isai Scheinberg, for his pivotal work in guiding the Unicode Standard through the intricacies of the ISO standards devel- opment process and his contributions to the bidirectional algorithm; and Layne Cannon and Stuart Smith for the Order Mark. Past contributors to the development of the character databases and mapping tables include . D. Chang, Tim Greenwood, Lori Hoerth, and Liao Huan-Mei. Other significant contributors to earlier versions were John Bennett, Joe Bosurgi, Andy Daniels, Burwell Davis, Bill English, Masami Hasegawa, Greg Hullander, Mike Kernaghan, Eric Mader, Dave Opstad, Bill Tuthill, J. . Van Stee, and Glenn Wright. The Unicode Consortium continues to maintain mutually beneficial relationships with international standards organizations. We appreciate the efforts and support of the - bers of ISO/IEC JTC1/SC2/WG2 and the members of the Ideographic Rapporteur Group toward the common goal of keeping both standards synchronized. We would particularly like to thank the Convener of WG2, Mike Ksar, the Rapporteur of the IRG, Zhang Zhoucai, and WG2’s Editors and Contributing Editors, especially Bruce Paterson, Michel Suignard, and Michael Everson. We also thank Asmus Freytag for his effective representation of the Unicode Consortium at WG2 meetings. We would like to thank the members of ISO/IEC JTC1/SC22/WG20, especially Alain LaBonté and the Convener, Arnold F. Winkler, for their work with the Consortium on a common collation algorithm and properties related to identifiers. The support of member companies has been crucial to The Unicode Standard, Version 3.0. Particular thanks for facilities, equipment, and resources are owed to Adobe Systems, Inc., Apple Computer, Inc., Hewlett-Packard Company, IBM Corporation, Microsoft Corpora- tion, and Monotype Typography, Inc. While we gratefully acknowledge the contributions of all persons named in this section, any errors or omissions in this work are the responsibility of the Unicode Consortium.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. vii Unicode Consortium Members and Directors

Full Members While Version 3.0 of the Unicode Standard was under development, the following compa- nies were Full Members of the Unicode Consortium: Apple Computer, Inc. NeXT Software, Inc. Basis Technology Corporation Novell, Inc. Booz, Allen & Hamilton, Inc. Oracle Corporation Compaq Computer Corporation The Research Libraries Group, Inc. Gamma Productions, Inc. Reuters, Ltd. Hewlett-Packard Company SAP AG IBM Corporation Silicon Graphics, Inc. Justsystem Corporation Spyglass, Inc. MATHEMA Software GmbH Sun Microsystems, Inc. MGI Software Corporation Sybase, Inc. Microsoft Corporation Unisys Corporation NCR Corporation Xerox Corporation

Current Associate Members Adobe Systems, Inc. Libronix Corporation Arphic Technology Co. Ltd. Research Systems, Inc. ASMUS, Inc. Lycos, Inc. Automated Wagering International, Inc. Monotype Typography, Inc. Beijing Zhong Yi Electronics Co. Netscape Communications Corporation Bitstream, Inc. OCLC, Inc. BMC Software, Inc. OneRealm, Inc. The Church of Jesus Christ of Latter-day Production First Software Saints Progress Software Corporation Columbia University QUALCOMM Incorporated Data Research Associates Quark, Inc. DecoType, Inc. Rogue Wave Software Endeavor Information Systems, Inc. The Royal Library, Sweden Ericsson Mobile Communications RWS Group, LLC eTranslate, Inc. SHARE, Inc. Information Services, Inc. Siebel Systems, Inc. The Getty Information Institute Software AG Global System for Sustainable Development Software Engineering Australia, Ltd. Government of Tamil Nadu, India StarTV—Satellite Television Asia Region Telecom CSL Ltd. Innovative Interfaces, Inc. Symbian, Ltd. Internet Mail Consortium Uniscape, Inc. Language Analysis Systems, Inc. VTLS, Inc.

viii Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Current Liaison Members Association for Font Information Interchange Korea, Kongju National Library, Chung-nam (AFII) Viet Nam, Technical Committee on Information , Center of Computer and Information Technology (TCVN/TC1), Hanoi Development (CCID), Beijing Consortium (W3C) I18N The Internet Engineering Task Force (IETF) Working Group

Current Specialist Members James Agenbroad, Rolf Bourges, G. David Butler II, James Caldwell, Peter Constable, Tony Graham, Stephen Holmes, Kent Karlsson

Current Individual Members Lloyd Anderson, Stephanos Antoniades, Dr. Yves Arrouye, Stella Ayzinger, Hasan Bateson, Charles Bishop, Philip Blair, Brad Bodkin, François-Paul Briand, Brian M. Carr, Shang- Cheng, Marco Cimarosti, John Dlugosz, George Foot, Richard A. Gard, Dr. Harry Gaylord, Dan Glisson, Jennifer Goodman, Tim Greenwood, Phillip H. Griffin, Bob Hallisy, James Harlow, Nyr Indictor, Jens K. Jensen, Bohdan Kantor, James Kass, Tim Kimball, W. J. Kurmey, Rob Latousek, Kevin Leake, Jonathan Lewis, Giovanni Lussu, Pankaj Madia, Dr. Karnati Mallesham, Greg Melahn, Dr. Tag Young Moon, Robert J. Morris, Victor Nguyen, Paul Pavlik, Phan Phan, Gabriel Plumlee, Christopher Pullen, Patrick P. Reilly, Daniel Rybowski, Tetsuo Seto, CathyAnn Swindlehurst, Lang-Fen Hu Tang, Mituhiko Toho, Tom Twiddy, L. Segond von Banchet, Joan Winters

Current Members of the Board of Directors Wayne . Boyle (NCR Corporation) Brian E. Carpenter (IBM Corporation & Internet Architecture Board) Kevin Cavanaugh (Lotus Development Corporation) Tatsuo L. Kobayashi (Scholex Company, Ltd.) Mike Ksar (Hewlett-Packard Company) Elizabeth G. Nichols (IBM Corporation) Wendy Rannenberg (Compaq Computer Corporation) Franz G. Rau (Microsoft Corporation) David R. Richards (The Research Libraries Group, Inc.)

Former Members of the Board of Directors Jerry Barber (Aldus Corporation) Paul Maritz (Microsoft Corporation) Janet Buschert (Hewlett-Packard Company) Stephen P. Oksala (Unisys Corporation) Robert M. Carr (Go Corporation) Mike Potel (Taligent, Inc.) John Gage (Sun Microsystems, Inc.) Rick Spitz (Apple Computer, Inc.) Paul Hegarty (NeXT Software, Inc.) Lawrence Tesler (Apple Computer, Inc.) Gary C. Hendrix (Symantec Corporation) Guy “Bud” Tribble (NeXT Software, Inc.) Richard J. Holleman (IBM Corporation) Kazuya Watanabe (Novell KK) Charles Irby (Metaphor, Inc.) Gayn B. Winters (Digital Equipment Corpora- Jay E. Israel (Novell, Inc.) tion) Ilene H. Lang (Digital Equipment Corporation)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. ix x Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 1 Introduction

The Unicode Standard is the universal character encoding scheme for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. As the default encoding of HTML and XML, the Unicode Standard provides a sound underpinning for the World Wide Web and new methods of business in a networked world. Required in new Internet protocols and implemented in all modern operating systems and computer such as Java, Unicode is the basis of software that must function all around the world.

With Unicode, the information technology industry gains data stability instead of proliferating character sets; greater global interoperability and data interchange; and simplified software and reduced development costs.

While modeled on the ASCII character set, the Unicode Standard goes far beyond ASCII's limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world--more than 1 million characters can be encoded. No or control code is required to specify any character in any language. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility (see Figure 1-1, Wide ASCII ).

Figure 1-1. Wide ASCII

The Unicode Standard specifies a numeric value and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward. In addition to character codes and names, other information is crucial to ensure legible text: a character's case, directionality, and alphabetic properties must be well defined. The Unicode Standard defines this and other semantic information, and includes application data such as case mapping tables and mappings to the repertoires of international, national, and industry character sets. The Unicode Consortium provides this additional information to ensure consistency in the implementation and interchange of Unicode data.

Unicode provides for two encoding forms: a default 16-bit form and a byte-oriented form called UTF-8 that has been designed for ease of use with existing ASCII-based systems. The Unicode Standard, Version 3.0, is code-for-code identical with International Standard ISO/IEC 10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/IEC 10646.

Using a 16-bit encoding means that code values are available for more than 65,000 characters. While this number is sufficient for coding the characters used in the major languages of the world, the Unicode Standard and ISO/IEC 10646 provide the UTF-16 extension mechanism (called surrogates in the Unicode Standard), which allows for the encoding of as many as 1 million additional characters without any use of escape codes. This capacity is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world.

1.1 Coverage

The Unicode Standard, Version 3.0, contains 49,194 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia. The unified Han contains 27,484 ideographic characters defined by national and industry standards of China, Japan, Korea, , , and . In addition, the Unicode Standard includes punctuation marks, mathematical symbols, technical symbols, , and .

Many new scripts and characters have been added in Version 3.0, including Ethiopic, Canadian Aboriginal Syllabics, Cherokee, Sinhala, Syriac, Myanmar, Khmer, Mongolian, Braille, and additional ideographs. Overall character allocation and code ranges are detailed in Chapter 2, General Structure .

Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor does it encode logos or graphics. Graphologies unrelated to text, such as dance notations, are likewise outside the scope of the Unicode Standard. Font variants are explicitly not encoded. The Unicode Standard reserves 6,400 code values in the basic 16-bit encoding for the Private Use Area, which may be used to assign codes to characters not included in the repertoire of the Unicode Standard.

There are 7,827 unused code values, which will permit future expansion in the basic encoding space. Provision has also been made for another 917,476 code points through the use of surrogate pairs. Surrogate pairs make another 131,068 private-use code points available should 6,400 prove insufficient for particular applications.

Standards Coverage

The Unicode Standard is a superset of all characters in widespread use today. It contains the characters from major international and national standards as well as prominent industry character sets. For example, Unicode incorporates the ISO/IEC 6937 and ISO/IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic standards such as ISO 5426. Important national standards contained within Unicode include ANSI Z39.64, C 5601, JIS X 0209, JIS X 0212, GB 2312, and CNS 11643. Industry code pages and character sets from Adobe, Apple, , Hewlett-Packard, IBM, Lotus, Microsoft, NEC, and Xerox are fully represented as well.

For a complete list of ISO and national standards used as sources, see the References section of The Unicode Standard, Version 3.0.

New Characters

The Unicode Standard continues to respond to new and changing industry demands by encoding important new characters. As an example, when the need to support the sign arose, The Unicode Standard, Version 2.1, with euro support was issued to ensure a conforming version.

As the universal character encoding scheme, the Unicode Standard must also respond to scholarly needs. To preserve world cultural heritage, important archaic scripts are encoded as proposals are developed.

For instructions on how to submit proposals for new characters to the Unicode Consortium, see Submitting New Characters.

1.2 Design Basis

The primary goal of the development effort for the Unicode Standard was to remedy two serious problems common to most multilingual computer programs. The first problem was the overloading of the font mechanism when encoding characters. Fonts have often been indiscriminately mapped to the same set of . For example, the bytes 0x00 to 0xFF are often used for both characters and dingbats. The second major problem was the use of multiple, inconsistent character codes because of conflicting national and industry character standards. In Western European software environments, for example, one often finds confusion between the Windows Latin 1 1252 and ISO/IEC 8859-1. In software for East Asian ideographs, the same set of bytes used for ASCII may also be used as the second byte of a double-byte character. In these situations, software must be able to distinguish between ASCII and double-byte characters.

The ASCII 7-bit code space and its 8-bit extensions, although used in most computing systems, are limited to 128 and 256 code positions, respectively. These 7- and 8-bit code spaces are inefficient and completely inadequate in the global computing environment.

When the Unicode project began in 1988, the groups most affected by the lack of a consistent international character standard included publishers of scientific and mathematical software, newspaper and book publishers, bibliographic information services, and academic researchers. More recently, the computer industry has adopted an increasingly global outlook, building international software that can be easily adapted to meet the needs of particular locations and cultures. The explosive growth of the Internet has merely added to the demand for a character set standard that can be used all over the world.

The designers of the Unicode Standard envisioned a uniform method of character identification that would be more efficient and flexible than previous encoding systems. The new system would satisfy the needs of technical and multilingual computing and would encode a broad range of characters for professional-quality typesetting and desktop publishing worldwide.

The Unicode Standard was designed to be:

• Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange, including those in major international, national, and industry character sets. • Efficient. is simple to parse: software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. • Uniform. A character code allows for efficient sorting, searching, display, and editing of text. • Unambiguous. Any given 16-bit value always represents the same character.

Figure 1-2, Universal, Efficient, and Unambiguous demonstrates some of these features, contrasting the Unicode encoding with mixtures of single-byte character sets with escape sequences to shift the meanings of bytes.

Figure 1-2. Universal, Efficient, and Unambiguous

1.3 Text Handling

Computer text handling involves processing and encoding. When a word processor user types in text via a keyboard, the computer's system software receives a message that the user pressed a key combination for "T", which it encodes as U+0054. The word processor stores the number in memory and also passes it on to the display software responsible for putting the character on the screen. This display software, which may be a windows manager or part of the word processor itself, then uses the number as an index to find an image of a "T", which it draws on the monitor screen. The process continues as the user types in more characters.

The Unicode Standard directly only the encoding and semantics of text and not any other actions performed on the text. In the preceding scenario, the word processor might check the typist's input after it has been encoded to look for misspelled words, and then highlight any errors it finds. Alternatively, the word processor might insert line breaks when it counts a certain number of characters entered since the last line break. An important principle of the Unicode Standard is that the standard does not specify how to carry out these processes as long as the character encoding and decoding is performed properly and the character semantics are maintained.

Interpreting Characters

The difference between identifying a code value and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code value is an abstract entity, such as "LATIN CAPITAL LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper, called a glyph, is a visual representation of the character.

The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how are rendered. Ultimately, the software or hardware rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, or orientation of on-screen characters.

Text Elements

The successful encoding, processing, and interpretation of text requires appropriate definition of useful elements of text and the basic rules for interpreting text. The definition of text elements often changes depending on the process handling the text. For example, when searching for a particular word or character written with the , one often wishes to ignore differences of case. However, correct spelling within a document requires case sensitivity.

The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements of encoding, called code values. A code value, commonly called a character, is fundamental and useful for computer text processing. For the most part, code values correspond to the most commonly used text elements.

1.4 The Unicode Standard and ISO/IEC 10646

The Unicode Standard is fully compatible with the international standard ISO/IEC 10646-1:2000, Information Technology--Universal Multiple-Octet Coded Character Set (UCS)--Part 1: Architecture and Basic Multilingual , which is also known as the Universal Character Set (UCS). During 1991, the Unicode Consortium and the International Organization for Standardization (ISO) recognized that a single, universal character code was highly desirable. A formal convergence of the two standards was negotiated, and their repertoires were merged into a single character encoding in January 1992. Since then, close cooperation and formal liaison between the committees have ensured that all additions to either standard are coordinated and kept synchronized, that the two standards maintain exactly the same character repertoire and encoding.

Version 3.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646-1:2000. This code-for-code identity holds true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters. ISO/IEC 10646 provides character names and code values; the Unicode Standard provides the same names and code values plus important implementation algorithms, properties, and other useful semantic information.

For details about the Unicode Standard and ISO/IEC 10646, see Appendix C, Relationship to ISO/IEC 10646 , and Appendix D, Changes from Unicode Version 2.0 in The Unicode Standard, 3.0.

1.5 The Unicode Consortium

The Unicode Consortium was incorporated in January 1991, under the name Unicode, Inc., to promote the Unicode Standard as an international encoding system for information interchange, to aid in its implementation, and to maintain quality control over future revisions.

To further these goals, the Unicode Consortium cooperates with the International Organization for Standardization (ISO/IEC/JTC1). It holds a Class C liaison membership with ISO/IEC JTC1/SC2; it participates in the work of both JTC1/SC2/WG2 (the technical working group for the subcommittee within JTC1 responsible for character set encoding) and the Ideographic Rapporteur Group (IRG) of WG2. The Consortium is a member company of the National Committee for Information Technology Standards, Technical Committee L2 (NCITS/L2), an accredited U.S. standards organization. In addition, Full Member companies of the Unicode Consortium have representatives in many countries who also work with other national standards bodies.

A number of organizations are Liaison Members of the Unicode Consortium: the Center for Computer & Information Development (CCID, China), the Internet Engineering Task Force (IETF), the Kongju National Library (Chung-nam, Korea), the Technical Committee on Information Technology (TCVN/TC1, Viet Nam), and the World Wide Web Consortium (W3C) I18N Working Group.

Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and who would like to assist in its extension and widespread implementation. Full and Associate Members represent a broad spectrum of corporations and organizations in the computer and information processing industry. The Consortium is supported financially solely through membership dues.

The Unicode Technical Committee The Unicode Technical Committee (UTC) is the working group within the Consortium responsible for the creation, maintenance, and quality of the Unicode Standard. The UTC controls all technical input to the standard and makes associated content decisions. Full Members of the Consortium vote on UTC decisions. Associate and Specialist Members and Officers of the Unicode Consortium are nonvoting UTC participants. Other attendees may participate in UTC discussions at the discretion of the Chair, as the intent of the UTC is to act as an open forum for the free exchange of technical ideas.

Chapter 2 General Structure 2

This chapter discusses the fundamental principles governing the design of the Unicode Standard and presents an overview of its main features. It includes discussion of text pro- cesses, unification principles, allocation of codespace, character properties, writing direc- tion, and a description of combining marks and how they are employed in Unicode character encoding. This chapter also gives general requirements for creating a text- processing system that conforms to the Unicode Standard. Formal requirements for conformance appear in Chapter 3, Conformance. Character properties, both normative and informative, are given in Chapter 4, Character Properties. A set of guidelines for implement- ers is provided in Chapter 5, Implementation Guidelines.

2.1 Architectural Context A character code standard such as the Unicode Standard enables the implementation of useful processes operating on textual data. The interesting end products are not the charac- ter codes but the text processes, because these directly serve the needs of a system’s users. Character codes are like nuts and bolts—minor, but essential and ubiquitous components used in many different ways in the construction of computer software systems. No single design of a character set can be optimal for all uses, so the architecture of the Unicode Stan- dard strikes a balance among several competing requirements.

Basic Text Processes Most computer systems provide low-level functionality for a small number of basic text processes from which more sophisticated text-processing capabilities are built. The follow- ing text processes are supported by most computer systems to some degree: • Rendering characters visible (including ligatures, contextual forms, and so on) • Breaking lines while rendering (including hyphenation) • Modifying appearance, such as point size, kerning, underlining, slant, and weight (light, demi, bold, and so on) • Determining units such as “word” and “sentence” • Interacting with users in processes such as selecting and highlighting text • Modifying keyboard input and editing stored text through insertion and dele- tion • Comparing text in operations such as determining the sort order of two strings, or filtering or matching strings • Analyzing text content in operations such as spell-checking, hyphenation, and parsing morphology (that is, determining word roots, stems, and affixes)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 9 2.1 Architectural Context General Structure

• Treating text as bulk data for operations such as compressing and decompress- ing, truncating, transmitting, and receiving

Text Elements, Code Values, and Text Processes One of the more profound challenges in designing a worldwide character encoding stems from the fact that, for each text process, written languages differ in what is considered a fundamental unit of text, or a text element. For example, in traditional German , the letter combination “ck” is a text ele- ment for the process of hyphenation (where it appears as “k-k”), but not for the process of sorting; in Spanish, the combination “” may be a text element for the traditional process of sorting (where it is sorted between “l” and “m”), but not for the process of rendering; and in English, the objects “A” and “a” are usually distinct text elements for the process of rendering, but generally not distinct for the process of searching text. The text elements in a given language depend upon the specific text process; a text element for spell-checking may have different boundaries from a text element for sorting purposes. A character encoding standard provides the fundamental units of encoding (that is, the abstract characters), which must exist in a unique relationship to the assigned numerical code values. These code values are the smallest addressable units of stored text. An important class of text elements is called a , which typically corresponds to what a user thinks of as a “character.” Figure 2-1 illustrates the relationship between abstract characters and . Figure 2-1. Text Elements and Characters Text Elements Characters ‚

‚ C ü

Grapheme: c h (Spanish)

çS · ÷ á ç

Word: cat c a t

The design of the character encoding must provide precisely the set of code values that allows programmers to design applications capable of implementing a variety of text pro- cesses in the desired languages. These code values may not map directly to any particular set of text elements that is used by one of these processes.

10 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.1 Architectural Context

Text Processes and Encoding In the case of English text using an encoding scheme such as ASCII, the relationships between the encoding and the basic text processes built on it are seemingly straightforward: characters are generally rendered visible one by one in distinct rectangles from left to right in linear order. Thus one character code inside the computer corresponds to one logical character in a process such as simple English rendering. When designing an international and multilingual text encoding such as the Unicode Stan- dard, the relationship between the encoding and implementation of basic text processes must be considered explicitly, for several reasons: • Many assumptions about character rendering that hold true for English fail for other writing systems. Unlike in English, characters in other writing systems are not necessarily rendered visible one by one in rectangles from left to right. In many cases, character positioning is quite complex and does not proceed in a linear fashion. See Section 8.2, Arabic, in Chapter 8, Middle Eastern Scripts, and Section 9.1, Devanagari, in Chapter 9, South and Southeast Asian Scripts, for detailed examples of this situation. • It is not always obvious that one set of text characters is an optimal encoding for a given language. For example, two approaches exist for the encoding of accented characters commonly used in French or Swedish: ISO/IEC 8859 defines letters such as “ä” and “ö” as individual characters, whereas ISO 5426 represents them by composition instead. In the Swedish language, both are considered distinct letters of the , following the letter “z”. In French, the on a merely marks it as being pronounced in isolation. In practice, both approaches can be used to implement either language. • No encoding can support all basic text processes equally well. As a result, some trade-offs are necessary. For example, ASCII defines separate codes for upper- case and lowercase letters. This choice causes some text processes, such as ren- dering, to be carried out more easily, but other processes, such as comparison, to become more difficult. A different encoding design for English, such as case- shift control codes, would have the opposite effect. In designing a new encoding scheme for complex scripts, such trade-offs must be evaluated and decisions made explicitly, rather than unconsciously. For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering exists. (See Section 5.17, Sorting and Searching, for implementation guidelines.) Text processes supporting many languages are often more complex than they are for English. The character encoding design of the Unicode Standard strives to minimize this additional complexity, enabling modern computer systems to interchange, render, and manipulate text in a user’s own script and language—and possibly in other languages as well.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 11 2.2 Unicode Design Principles General Structure

2.2 Unicode Design Principles The design of the Unicode Standard reflects the 10 fundamental principles stated in Table 2-1. Not all of these principles can be satisfied simultaneously. The design strikes a balance between maintaining consistency for the sake of simplicity and efficiency and maintaining compatibility for interchange with existing standards. Table 2-1. The 10 Unicode Design Principles

Principle Statement Sixteen-bit character codes Unicode character codes have a width of 16 bits. Efficiency Unicode text is simple to parse and process. Characters, not glyphs The Unicode Standard encodes characters, not glyphs. Semantics Characters have well-defined semantics. Plain text The Unicode Standard encodes plain text. Logical order The default for memory representation is logical order. Unification The Unicode Standard unifies duplicate characters within scripts across languages. Dynamic composition Accented forms can be dynamically composed. Equivalent sequence Static precomposed forms have an equivalent dynamically composed sequence of characters. Convertibility Accurate convertibility is guaranteed between the Unicode Standard and other widely accepted standards.

Sixteen-Bit Character Codes Plain Unicode text consists of sequences of 16-bit Unicode character codes. Sequences of 16-bit codes are easy to parse and allow for efficient sorting, searching, display, and editing of text. From the full range of 65,536 code values, 63,486 are available to represent charac- ters with single 16-bit code values, and 2,048 code values are available to represent an addi- tional 1,048,544 characters through paired 16-bit code values. These paired code values, or surrogates, will allow implementations access to additional characters in the future. None of these surrogate pairs has been assigned in this version of the standard. The unrestricted range of 256 byte values is used for either of the two bytes that compose a Unicode 16-bit code value. An 8-bit-oriented process that expects certain ranges of byte values to be reserved may fail if unexpectedly presented with Unicode’s 16-bit character codes. For compatibility with existing environments, a lossless transformation for convert- ing 16-bit Unicode values into a form appropriate for 8-bit environments has been defined; it is called UTF-8. UTF-8 (Unicode Transformation Format-8) is the standard method for transforming Uni- code values into a sequence of 8-bit codes. It is intended to be used where 8-bit codes are needed and/or when the ASCII range of values need to be preserved—for example, when transmitting data through byte-oriented protocols or when using byte-oriented APIs. For further information on transformation formats of Unicode, see Section 2.3, Encoding Forms, and Appendix C.3, UCS Transformation Formats, in this book. Also see ISO/IEC 10646 Transformation Formats. For a format to be used on EBCDIC-based systems, see Unicode Technical Report #16, “UTF-EBCDIC,” on the CD-ROM or the up-to-date ver- sion on the Unicode Web site.

12 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles

Efficiency The Unicode Standard is designed to make efficient implementation possible. There are no escape characters or shift states in the Unicode character encoding scheme. Each character code has the same status as any other character code; all codes are equally accessible. All encoding forms are self-synchronizing. When randomly accessing a string, a program can find the boundary of a character with limited backup. In UTF-16, if a pointer points to a low-surrogate, a single backup is required. In UTF-8, if a pointer points to a byte starting with 10xxxxxx (in binary), one to three backups are required to find the beginning of the character. By convention, characters of a script are grouped together as far as is practical. Not only is this practice convenient for looking up characters in the code charts, but it makes imple- mentations more compact. The common punctuation characters are shared. Formatting characters are given specific and unambiguous function in the Unicode Stan- dard. This design simplifies the support of . To keep implementations simple and efficient, stateful controls and formatting characters are avoided wherever possible.

Characters, Not Glyphs The Unicode Standard draws a distinction between characters, which are the smallest com- ponents of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between character and glyph: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character. The distinction between characters and glyphs is illustrated in Figure 2-2. Unicode characters represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. Characters are rep- resented by code values that reside only in a memory representation, as strings in memory, or on disk. The Unicode Standard deals only with character codes. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters. A repertoire of glyphs makes up a font. Glyph shape and meth- ods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard. Figure 2-2. Characters Versus Glyphs

Glyphs Unicode Characters A ei_` c U+0041 LATIN CAPITAL LETTER A a fjabh a d U+0061 LATIN SMALL LETTER A fi fi U+0066 LATIN SMALL LETTER F + U+0069 LATIN SMALL LETTER klnm U+0647 ARABIC LETTER HEH

For certain scripts, such as Arabic and the various Indic scripts, the number of glyphs needed to display a given script may be significantly larger than the number of characters encoding the basic units of that script. The number of glyphs may also depend on the orthographic style supported by the font. For example, an Arabic font intended to support the style of Arabic script may possess many thousands of glyphs. However, the

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 13 2.2 Unicode Design Principles General Structure character encoding employs the same few dozen letters regardless of the font style used to depict the character data in context. A font and its associated rendering process define an arbitrary mapping from Unicode val- ues to glyphs. Some of the glyphs in a font may be independent forms for individual char- acters; others may be rendering forms that do not directly correspond to any single character. The process of mapping from characters in the memory representation to glyphs is one aspect of text rendering. The final appearance of rendered text may also depend on context (neighboring characters in the memory representation), variations in typographic design of the fonts used, and formatting information (point size, superscript, subscript, and so on). The results on screen or paper can differ considerably from the prototypical shape of a letter or character, as shown in Figure 2-3. Figure 2-3. Unicode Character Code to Rendered Glyphs Text Character Sequence ① Ò 0000 1001 0010 1010 ② @ 0000 1001 0100 0010 ③ Ú 0000 1001 0011 0000 Font ④ @÷ 0000 1001 0100 1101 (Glyph Source) ⑤ Ì 0000 1001 0010 0100  ç@ 0000 1001 0011 0100

Text Rendering Process

 ③④ Ò① êçÌü ⑤ ②

For all scripts, an archetypical relation exists between character code sequences and result- ing glyphic appearance. For the Latin script, this relationship is simple and well known; for several other scripts, it is documented in this standard. However, in all cases, fine typogra- phy requires a more elaborate set of rules than given here. The Unicode Standard docu- ments the default relationship between character sequences and glyphic appearance solely for the purpose of ensuring that the same text content is always stored with the same, and therefore interchangeable, sequence of character codes.

14 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles

Semantics Characters have well-defined semantics. Character property tables are provided for use in parsing, sorting, and other algorithms requiring semantic knowledge about the code points. See Section 5.15, Locating Text Element Boundaries, Section 5.16, Identifiers, and Section 5.17, Sorting and Searching, for suggested implementations. The properties identi- fied by the Unicode Standard include numeric, spacing, combination, and directionality properties (see Chapter 4, Character Properties). Additional properties may be defined as needed from time to time. By itself, neither the character name nor its location in the code table designates its properties. For an exception, see Section 4.1, Case—Normative.

Plain Text Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes. In contrast, fancy text, also known as rich text, is any text representation consisting of plain text plus added information such as a language iden- tifier, font size, color, hypertext links, and so on. For example, the text of this book, a mul- tifont text as formatted by a desktop publishing system, is fancy text. Many kinds of data structures can be built into fancy text. For example, in fancy text con- taining ideographs an application may store the phonetic of ideographs some- where in the fancy text structure. The simplicity of plain text gives it a natural role as a major structural element of fancy text. SGML, HTML, XML, or TEX are examples of fancy text fully represented as plain text streams, interspersing plain text data with sequences of characters that represent the addi- tional data structures. Many popular word processing packages rely on a buffer of plain text to represent the content and to implement links to a parallel store of formatting data. The relative functional roles of both plain and fancy text are well established: • Plain text is the underlying content stream to which formatting can be applied. • Plain text is public, standardized, and universally readable. • Fancy text representation may be implementation-specific or proprietary. Although some fancy text formats have been standardized or made public, the majority of fancy text designs are vehicles for particular implementations and are not necessarily read- able by other implementations. Given that fancy text equals plain text plus added informa- tion, the extra information in fancy text can always be stripped away to reveal the “pure” text underneath. This operation is often employed, for example, in word processing sys- tems that use both their own private fancy format and plain text file format as a universal, if limited, means of exchange. Thus, by default, plain text represents the basic, interchange- able content of text. Standards for markup language, such as XML and HTML, use plain text for the entire file. They use special conventions embedded within the plain text file such as “

” to distin- guish the markup from the “real” content. XML, in particular, uses Unicode as a base-level encoding: “All XML processors must be able to read entities in either UTF-8 or UTF-16.”— W3C Recommendation: Extensible Markup Language (XML) 1.0, §4.3.3. Files not using these character sets in XML must have an encoding declaration to indicate any other char- acter set. Because plain text represents character content, it has no inherent appearance. It requires a rendering process to make it visible. If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. Instead, the disparate rendering processes are simply required to

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 15 2.2 Unicode Design Principles General Structure make the text legible according to the intended reading. Therefore, the relationship between appearance and content of plain text may be stated as follows: Plain text must contain enough information to permit the text to be rendered legibly, and nothing more. The Unicode Standard encodes plain text. The distinction between data encoded in the Unicode Standard and other forms of data in the same data stream is the function of a higher-level protocol and is not specified by the Unicode Standard itself. The 64 control code positions of ISO/IEC 6429 (commonly used with ISO/IEC 646 and ISO/IEC 8859) are retained for compatibility and may be used to implement such protocols. (See Section 2.8, Controls and Control Sequences.)

Logical Order For all scripts, Unicode text is stored in logical order in the memory representation, roughly corresponding to the order in which text is typed in via the keyboard. In some circum- stances, the order of characters differs from this logical order when the text is displayed or printed. Where needed to ensure consistent legibility, the Unicode Standard defines the conversion of Unicode text from the memory representation to readable (displayed) text. The distinction between logical order and display order for reading is shown in Figure 2-4. Figure 2-4. Bidirectional Ordering

When the text in Figure 2-4 is ordered for display, the glyph that represents the first charac- ter of the English text appears at the left. The logical start character of the Hebrew text, however, is represented by the Hebrew glyph closest to the right margin. The succeeding Hebrew glyphs are laid out to the left. Logical order applies even when characters of different dominant direction are mixed: left- to-right (Greek, Cyrillic, Latin) with right-to-left (Arabic, Hebrew), or with vertical script. Properties of directionality inherent in characters generally determine the correct display order of text. This inherent directionality is occasionally insufficient to render plain text legibly, however. This situation can arise when scripts of different directionality are mixed. For this reason, the Unicode Standard includes characters to specify changes in direction. Chapter 3, Conformance, provides rules for the correct presentation of text containing left- to-right and right-to-left scripts. For the most part, logical order corresponds to phonetic order. The only current exceptions are the Thai and Lao scripts, which employ visual ordering; in these two scripts, users tra- ditionally type in visual order rather than phonetic order. Characters such as the in Devanagari are displayed before the characters that they logically follow in the memory representation. (See Section 9.1, Devanagari, for further explanation.) Combining marks (accent marks in the Greek, Cyrillic, and Latin scripts, vowel marks in Arabic and Devanagari, and so on) do not appear linearly in the final rendered text. In a Unicode character code string, all such characters follow the base character that they

16 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles modify (for example, Roman “ã” is stored as “a” followed by combining “þ”ì when not stored in a precomposed form).

Unification The Unicode Standard avoids duplicate encoding of characters by unifying them within scripts across languages; characters that are equivalent in form are given a single code. Common letters, punctuation marks, symbols, and are given one code each, regardless of language, as are common Chinese/Japanese/Korean (CJK) ideographs. (See Section 10.1, Han.) Care has been taken not to make artificial distinctions among characters. Users may become confused when they see an Å on the screen but their search dialog does not find it. The reason that this situation occurs is that what they see on the screen is not an Å (A- )—it is an Å (Ångström). This phenomenon is called visual ambiguity. It is quite normal for many characters to have different usages, such as “,” for either thousands-separator (English) or decimal-separator (French). The Unicode Standard avoids duplication of characters due to specific usage in different languages; rather, it duplicates characters only to support compatibility with base standards. The Unicode Standard does not attempt to encode features such as language, font, size, positioning, glyphs, and so forth. For example, it does not preserve language as a part of character encoding: just as French i grecque, German ypsilon, and English wye are all repre- sented by the same character code, U+0057 “Y”, so too are Chinese zi, Japanese ji, and Korean all represented as the same character code, U+5B57 . In determining whether to unify variant ideograph forms across standards, the Unicode Standard follows the principles described in Section 10.1, Han. Where these principles determine that two forms constitute a trivial (wazukana) difference, the Unicode Standard assigns a single code. Otherwise, separate codes are assigned. Compatibility Characters. Compatibility characters are those that would not have been encoded (except for compatibility) because they are in some sense variants of characters that have already been coded. The examples are the glyph variants in the Compati- bility Area: halfwidth characters, Arabic contextual form glyphs, Arabic ligatures, and so on. The Compatibility Area contains a large number of compatibility characters, but the Uni- code Standard also contains many compatibility characters that do not appear in the Com- patibility Area. Examples include , such as the IV “character.” It is important to be able to identify which characters are compatibility characters so that Uni- code-based systems can treat them in a uniform way. Identifying one character as a compatibility variant of another character implies that gen- erally the first can be remapped to the other without the loss of any information other than formatting. Such remapping cannot always take place because many of the compatibility characters are in place just to allow systems to maintain one-to-one mappings to existing code sets. In such cases, a remapping would lose information that is felt to be important in the original set. Compatibility mappings are called out in Section 14.1, Character Names List. Because replacing a character by its compatibly equivalent character or character sequence may change the information in the text, implementation must proceed with cau- tion. A good use of these mappings may not be in transcoding, but rather in providing the correct equivalence for searching and sorting.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 17 2.2 Unicode Design Principles General Structure

Dynamic Composition The Unicode Standard allows for the dynamic composition of accented forms and Hangul syllables. Combining characters used to create composite forms are productive. Because the process of character composition is open-ended, new forms with modifying marks may be created from a combination of base characters followed by combining characters. For example, the diaeresis, “¨”, may be combined with all and a number of in languages using the Latin script and several other scripts.

Equivalent Sequence Some text elements can be encoded either as static precomposed forms or by dynamic composition. Common precomposed forms such as U+00DC “Ü”       are included for compatibility with current standards. For static pre- composed forms, the standard provides a mapping to the canonically equivalent dynami- cally composed sequence of characters. (See also Section 3.6, Decomposition.) In many cases, different sequences of Unicode characters are considered equivalent. For example, a may be represented as a composed character sequence (see Figure 2-5 and Figure 2-10). Figure 2-5. Equivalent Sequences B + Ä B + A + @¨ + A L + J + A

In such cases, the Unicode Standard does not prescribe one particular sequence; all of the sequences in the examples are equivalent. Systems may choose to normalize Unicode text to one particular sequence, such as by normalizing composed character sequences into pre- composed characters, or vice versa. Therefore, any distinctions made between nonidentical equivalent sequences by applications or users are not guaranteed to be interchangeable. (For implementation guidelines, see Section 5.7, Normalization.)

Convertibility Character identity is preserved for interchange with a number of different base standards, including national, international, and vendor standards. Where variant forms (or even the same form) are given separate codes within one base standard, they are also kept separate within the Unicode Standard. This choice guarantees the existence of a mapping between the Unicode Standard and base standards. Accurate convertibility is guaranteed between the Unicode Standard and other standards in wide usage as of May 1993. In general, a single code value in another standard will corre- spond to a single code value in the Unicode Standard. Sometimes, however, a single code value in another standard corresponds to a sequence of code values in the Unicode Stan- dard, or vice versa. Conversion between Unicode text and text in other character codes must in general be done by explicit table-mapping processes. (See also Section 5.1, Transcoding to Other Standards.)

18 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.3 Encoding Forms

2.3 Encoding Forms The Unicode Standard character encoding model includes a repertoire of characters, their mapping to a set of integers, and their encoding forms. Character encoding forms specify the representation of characters as actual data in a computer. The Unicode Standard uses two encoding forms: 16-bit and 8-bit. For a formal definition of these encoding forms, see Section 3.8, Transformations. Figure 2-6 illustrates the relationship between abstract characters, encoded characters, and character encoding schemes. In Figure 2-6: • The solid connect encoded characters with the abstract characters that they represent and encode. • The dotted arrow illustrates a hypothetical private-use variant of the A-ring character (the value F000016 is the Unicode scalar value for a surrogate pair in the Private Use Area). • The hollow arrow shows where an encoded character sequence represents an abstract character, but does not directly encode it. • The serialization column shows two of the possible serializations of the encoded characters. Figure 2-6. Character Encoding Schemes Abstract Encoded Serialized

UTF-16BE UTF- 8

C5 00 C5 C3 85

212B 21 2B E2 84 AB  F0000 DB 80 DC 00 F3 B0 80 80

A 61 30A 00 61 03 0A 61 CC 8A

¡

UTF-16 The default encoding form of the Unicode Standard is 16-bit: characters are assigned a unique 16-bit value, except that characters encoded by surrogates consist of a pair of 16-bit values. The Unicode 16-bit encoding form is identical to the ISO/IEC 10646 transforma- tion format UTF-16. In UTF-16, characters mapped up to 65535 are encoded as single 16- bit values; characters mapped above 65535 are encoded as pairs of 16-bit values (surro- gates). Surrogates are defined in Section 3.7, Surrogates.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 19 2.3 Encoding Forms General Structure

By using surrogates, UTF-16 provides a coded representation for more than 1 million graphic characters in a form that is compatible with 16-bit characters. This scheme permits the coexistence of a very large number of characters in systems that define characters as 16- bit entities. Surrogates were designed to be a simple and efficient extension mechanism that works well with older implementations and avoids many of the problems of multibyte encodings. See Section 5.4, Handling Surrogate Pairs, for more information about surrogate characters.

UTF-8 To meet the requirements of byte-oriented, ASCII-based systems, a second encoding form is specified by the Unicode Standard: UTF-8. It is a variable-length encoding that preserves ASCII transparency. UTF-8 usage is fully conformant with the Unicode Standard and ISO/ IEC 10646. Existing software and practices in information technology frequently depend on character data being represented as a sequence of bytes. To use Unicode character data in such sys- tems, Unicode character values and pairs of surrogates must be transformed into a sequence of one or more bytes that represent the same information, but which are restricted in their numerical range. UTF-8 is a variable-length encoding of byte sequences, where the high bits indicate the part of the sequence to which a byte belongs. Table 3-1 on page 47 shows how the bits in a Uni- code value (or a surrogate pair) are distributed among the bytes in the UTF-8 encoding. Any byte sequence that does not follow this pattern is an ill-formed byte sequence. See Section 3.8, Transformations, for a definition of ill-formed byte sequences. The UTF-8 encoding form maintains transparency for all of the ASCII code values (0..127). Further- more, the values 0..127 do not appear in any byte of a transformed result except as the direct representation of these ASCII values. The ASCII range (U+0000..U+007F) is repre- sented by single bytes; many of the non-ideographic scripts are represented by two bytes; the remaining Unicode scalar values are represented as three bytes; and surrogate pairs require four bytes. Other important characteristics of UTF-8 are as follows: • It is efficient to convert to and from 16-bit Unicode text. • The first byte indicates the number of bytes to follow in a multibyte sequence, allowing for efficient forward parsing. • It is efficient to find the start of a character when beginning from an arbitrary location in a byte stream. Programs need to search at most four bytes back- ward, and it is a simple task to recognize an initial byte. • UTF-8 is reasonably compact in terms of the number of bytes used for encod- ing. • A binary sort of UTF-8 strings gives the same ordering as a binary sort of Uni- code scalar values. • If surrogates are not used, a binary sort of UTF-8 strings gives the same order- ing as the binary sort of their UTF-16 counterparts. Sample code for transforming Unicode data (UTF-16) into UTF-8 is provided on the CD- ROM.

20 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.4 Unicode Allocation

Character Encoding Schemes A character encoding scheme consists of an encoding form plus byte serialization. Thus there are four character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16LE, and UTF-16BE. Often, encoding forms and character encoding schemes have the same name. The names of the Unicode encoding forms are the same as those used in other contexts. Notably, the IANA maintains a registry of charset names used on the Internet, which includes the same names as the encoding forms used here. Although the encoding forms here and the descriptions of the charsets registered with IANA are very similar, some important differences may arise in terms of the requirements for each. For more informa- tion, please see Unicode Technical Report #17, “Character Encoding Model,” on the CD- ROM or the up-to-date version on the Unicode Web site.

2.4 Unicode Allocation All code values in the Unicode Standard are equally accessible electronically; the exact assignment of character codes is of minor consequence for information processing. Never- theless, for the convenience of people who will use them, the codes are grouped by linguis- tic and functional categories. Such grouping offers the additional benefit of supporting various space-saving techniques in actual implementations.

Allocation Areas Figure 2-7 provides an overview of the Unicode codespace allocation. For convenience, the Unicode Standard codespace is divided into several areas, which are then subdivided into character blocks: • The General Scripts Area, consisting of alphabetic and syllabic scripts that have relatively small character sets, such as Latin, Cyrillic, Greek, Hebrew, Arabic, Devanagari, and Thai • The Symbols Area, including a large variety of symbols and dingbats, for punc- tuation, mathematics, chemistry, technical, and other specialized usage • The CJK Phonetics and Symbols Area, including punctuation, symbols, radicals, and phonetics for Chinese, Japanese, and Korean • The CJK Ideographs Area, consisting of 27,484 unified CJK ideographs • The Area, consisting of 1,165 syllables and 50 Yi radicals • The Hangul Syllables Area, consisting of 11,172 precomposed Korean Hangul syllables • The Surrogates Area, consisting of 1,024 low-half surrogates and 1,024 high-half surrogates that are used in the surrogate extension method to access more than 1 million codes for future expansion • The Private Use Area, containing 6,400 code positions used for defining user- or vendor-specific characters • The Compatibility and Specials Area, containing many of the characters from widely used corporate and national standards that have other representations in Unicode encoding, as well as several special-use characters The allocation of characters into areas reflects the evolution of the Unicode Standard and is not intended to define the usage of characters in implementations. For example, many

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 21 2.4 Unicode Allocation General Structure characters are included in the standard solely for reasons of compatibility with other stan- dards but are not coded in the Compatibility Area; there are many general-purpose sym- bols and punctuation in the CJK Phonetics and Symbols Area, while the Hangul conjoining jamo are in the General Scripts Area. A Private Use Area gives the Unicode Standard the necessary flexibility and matches wide- spread practice in existing standards. Successful interchange requires agreement between sender and receiver regarding interpretation of private-use codes. Figure 2-7. Unicode Allocation

U+0000 U+0000 U+0100 Latin U+1000 General U+0200 Scripts U+0300 U+2000 U+0400 Greek Cyrillic Symbols U+0500 Armenian/Hebrew U+3000 U+0600 CJK Misc. Arabic U+0700 Syriac/Thaana U+4000 U+0800 U+0900 Devanagari/Bengali U+5000 U+0A00 Gurmukhi/Gujarati U+0B00 Oriya/Tamil U+6000 CJKV U+0C00 Telugu/Kannada Ideographs U+0D00 Malayalam/Sinhala U+7000 U+0E00 Thai/Lao U+0F00 Tibetan U+8000 U+1000 Myanmar/Georgian U+1100 Hangul Jamo U+9000 U+1200 U+1300 Ethiopic U+A000 U+1400 Cherokee Yi U+1500 Canadian Aboriginal U+B000 U+1600 Syllabics U+1700 Ogham/Runic U+C000 U+1800 Khmer Mongolian Hangul U+1900 U+D000 U+1A00 U+1B00 U+E000 Surrogates U+1C00 U+1D00 U+F000 Private Use U+1E00 Extended Latin U+1F00 Extended Greek Compatibility U+2000 Primary Private Use Compatibility Reserved

22 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.4 Unicode Allocation

Codespace Assignment for Graphic Characters The predominant characteristics of codespace assignment in the Unicode Standard are as follows: • Where there is a single accepted standard for a script, the Unicode Standard generally follows it for the relative order of characters within that script. • The first 256 codes follow precisely the arrangement of ISO/IEC 8859-1 (Latin 1), of which 7-bit ASCII (ISO/IEC 646 IRV) accounts for the first 128 code positions. • Characters with common characteristics are located together contiguously. For example, the primary Arabic character block was modeled after ISO/IEC 8859- 6. The Arabic script characters used in Persian, Urdu, and other languages, but not included in ISO/IEC 8859-6, are allocated after the primary Arabic charac- ter block. Right-to-left scripts are grouped together. • To the extent possible, scripts do not cross 128-byte boundaries or 1,024-byte boundaries. • Codes that represent letters, punctuation, symbols, and diacritics that are gen- erally shared by multiple languages or scripts are grouped together in several locations. • The Unicode Standard makes no pretense to correlate character code allocation with language-dependent collation or case. • Unified CJK ideographs are laid out in two sections, each of which is arranged according to the Han ideograph arrangement defined in Section 10.1, Han. This ordering is roughly based on a radical-stroke count order.

Nongraphic Characters, Reserved and Unassigned Codes All code points except those mentioned below are reserved for graphic characters. Code points unassigned in this version of the Unicode Standard are available for assignment in later versions of the Unicode Standard to characters of any script. • Sixty-five codes (U+0000..U+001F and U+007F..U+009F) are reserved specifi- cally as control codes. Of the control codes, null (U+0000) can be used as a string terminator as in the C language, tab (U+0009) retains its customary meaning, and the others may be interpreted according to ISO/IEC 6429. (See Section 2.8, Controls and Control Sequences, and Section 13.1, Control Codes.) • Two codes are not used to encode characters: U+FFFF is reserved for internal use (as a sentinel) and should not be transmitted or stored as part of plain text. U+FFFE is also reserved; its presence indicates byte-swapped Unicode data. (See Section 13.6, Specials.) • A contiguous area of codes has been set aside for private use. Characters in this area will never be defined by the Unicode Standard. These codes can be freely used for characters of any purpose, but successful interchange requires an agreement between sender and receiver on their interpretation. • In addition, 2,048 codes have been allocated for surrogates, which are used in the extension mechanism.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 23 2.5 Writing Direction General Structure

2.5 Writing Direction Individual writing systems have different conventions for arranging characters into lines on a page or screen. Such conventions are referred to as a script’s directionality. For example, in the Latin script, characters run horizontally from left to right to form lines, and lines run from top to bottom. In Semitic scripts such as Hebrew and Arabic, characters are arranged from right to left into lines, although digits run the other way, making the scripts inherently bidirectional. Left-to-right and right-to-left scripts are frequently used together. In such a case, arranging characters into lines becomes more complex. The Unicode Standard defines an algorithm to determine the layout of a line. See Section 3.12, Bidirectional Behavior, for more informa- tion. East Asian scripts are frequently written in vertical lines that run from top to bottom. Lines are arranged from right to left, except for Mongolian. Most characters have the same shape and orientation when displayed horizontally or vertically, but many punctuation characters change their shape when displayed vertically. In a vertical context, letters and words from other scripts are generally rotated through 90-degree so that they, too, read from top to bottom. That is, letters from left-to-right scripts will be rotated clockwise and letters from right-to-left scripts will be rotated counterclockwise. In contrast to the bidirectional case, the choice to lay out text either vertically or horizon- tally is treated as a formatting style. Therefore, the Unicode Standard does not provide directionality controls to specify that choice. Other script directionalities are found in historical writing systems. For example, some ancient Numidian texts are written bottom to top, and Egyptian hieroglyphics can be writ- ten with varying directions for individual lines. Early Greek used a system called boustrophedon (literally, “ox-turning”). In boustrophedon writing, characters are arranged into horizontal lines, but the individual lines alternate between running right to left and running left to right, the way an ox goes back and forth when plowing a field. The letter images are mirrored in accordance with the direction of each individual line. Boustrophedon writing is of interest almost exclusively to scholars intent on reproducing the exact visual content of ancient texts. The Unicode Standard does not provide direct support for boustrophedon. Fixed texts can, however, be written in boustrophedon by using hard line breaks and directionality overrides.

2.6 Combining Characters Combining Characters. Characters intended to be positioned relative to an associated base character are depicted in the character code charts above, below, or through a . They are also annotated in the names list or in the character property lists as “combining” or “nonspacing” characters. When rendered, the glyphs that depict these characters are intended to be positioned relative to the glyph depicting the preceding base character in some combination. The Unicode Standard distinguishes two types of combining charac- ters: spacing and nonspacing. Nonspacing combining characters do not occupy a spacing position by themselves. In rendering, the combination of a base character and a nonspacing character may have a different advance width than the base character by itself. For example, an “Ó” may be slightly wider than a plain “i”. The spacing or nonspacing properties of a are defined in the Unicode Character Database.

24 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.6 Combining Characters

Diacritics. Diacritics are the principal class of nonspacing combining characters used with European . In the Unicode Standard, the term “diacritic” is defined very broadly to include accents as well as other nonspacing marks. All diacritics can be applied to any base character and are available for use with any script. A separate block is provided for symbol diacritics, generally intended to be used with sym- bol base characters. The blocks contain additional combining characters for particular scripts with which they are primarily used. As with other characters, the allocation of a combining character to one block or another identifies only its primary usage; it is not intended to define or limit the range of characters to which it may be applied. In the Uni- code Standard, all sequences of character codes are permitted. Other Combining Characters. Some scripts, such as Hebrew, Arabic, and the scripts of India and Southeast Asia, have spacing or nonspacing combining characters. Many of these combining marks encode vowel letters; as such, they are not generally referred to as “dia- critics.”

Sequence of Base Characters and Diacritics In the Unicode Standard, all combining characters are to be used in sequence following the base characters to which they apply. The sequence of Unicode characters U+0061 “a”     + U+0308 “þ”í   + U+0075 “u”     unambiguously encodes “äu” not “aü”. The ordering convention used by the Unicode Standard is consistent with the logical order of combining characters in Semitic and Indic scripts, the great majority of which (logically or phonetically) follow the base characters with respect to which they are positioned. This convention conforms to the way modern font technology handles the rendering of non- spacing graphical forms (glyphs) so that mapping from character memory representation order to font rendering order is simplified. It is different from the convention used in the bibliographic standard ISO 5426. A sequence of a base character plus one or more combining characters generally has the same properties as the base character. For example, “A” followed by “ ”ó has the same prop- erties as “”. In some contexts, enclosing diacritics confer a symbol property to the charac- ters they enclose. This idea is discussed more fully in Section 3.10, Canonical Ordering Behavior, but see also Section 3.12, Bidirectional Behavior. In the charts for Indic scripts, some vowels are depicted to the left of dotted circles (see Figure 2-8). This special case must be carefully distinguished from that of general combin- ing diacritical mark characters. Such vowel signs are rendered to the left of a let- ter or , even though their logical order in the Unicode encoding follows the consonant letter. The coding of these vowels in pronunciation order and not in visual order is consistent with the ISCII standard. Figure 2-8. Indic Vowel Signs Ó +ç@ çÓ

Multiple Combining Characters In some instances, more than one diacritical mark is applied to a single base character (see Figure 2-9). The Unicode Standard does not restrict the number of combining characters

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 25 2.6 Combining Characters General Structure that may follow a base character. The following discussion summarizes the treatment of multiple combining characters. (For the formal algorithm, see Chapter 3, Conformance.) Figure 2-9. Stacking Sequences

2 Characters Glyphs a @¬÷@ @. @ ä÷. ö ö If the combining characters can interact typographically—for example, a U+0304 -   and a U+0308  —then the order of graphic display is determined by the order of coded characters (see Figure 2-10). The diacritics or other com- bining characters are positioned from the base character’s glyph outward. Combining char- acters placed above a base character will be stacked vertically, starting with the first encountered in the logical store and continuing for as many marks above as are required by the character codes following the base character. For combining characters placed below a base character, the situation is reversed, with the combining characters starting from the base character and stacking downward. Figure 2-10. Interacting Combining Characters

LATIN SMALL LETTER A WITH ‹ LATIN SMALL LETTER A + COMBINING TILDE á a LATIN SMALL LETTER A + COMBINING ABOVE

LATIN SMALL LETTER + COMBINING DOT BELOW ‹. LATIN SMALL LETTER A + COMBINING TILDE + COMBINING DOT BELOW LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING TILDE á a LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING DOT ABOVE . LATIN SMALL LETTER A + COMBINING DOT ABOVE + COMBINING DOT BELOW

LATIN SMALL LETTER A WITH AND ACUTE « LATIN SMALL LETTER A WITH CIRCUMFLEX + COMBINING ACUTE ‰ LATIN SMALL LETTER A + COMBINING CIRCUMFLEX + COMBINING ACUTE

ö LATIN SMALL LETTER A ACUTE + COMBINING CIRCUMFLEX ‡ LATIN SMALL LETTER A + COMBINING ACUTE + COMBINING CIRCUMFLEX

An example of multiple combining characters above the base character is found in Thai, where a consonant letter can have above it one of the vowels U+0E34 through U+0E37 and, above that, one of four marks U+0E48 through U+0E4B. The order of character codes that produces this graphic display is base consonant character + vowel character + tone mark character. Some specific combining characters override the default stacking behavior by being posi- tioned horizontally rather than stacking or by with an adjacent nonspacing mark (see Figure 2-11). When positioned horizontally, the order of codes is reflected by

26 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.6 Combining Characters positioning in the predominant direction of the script with which the codes are used. For example, in a left-to-right script, horizontal accents would be coded left to right. In Figure 2-11, the top example is correct and the bottom example is incorrect. Figure 2-11. Nondefault Stacking

GREEK SMALL LETTER This is αÕ« + COMBINING COMMA ABOVE (psili) + COMBINING (oxia) correct GREEK SMALL LETTER ALPHA This is Õ + COMBINING ACUTE ACCENT (oxia) α« + COMBINING COMMA ABOVE (psili) incorrect

Prominent characters that show such override behavior are associated with specific scripts or alphabets. For example, when used with the Greek script, the “breathing marks” U+0313    (psili) and U+0314     (dasia) require that, when used together with a following acute or , they be rendered side-by-side above their base letter rather than the accent marks being stacked above the breathing marks. The order of codes here is base character code + breathing mark code + accent mark code. This example demonstrates the script-dependent nature of ren- dering combining diacritical marks.

Multiple Base Characters When the glyphs representing two base characters merge to form a ligature, then the com- bining characters must be rendered correctly in relation to the ligated glyph (see Figure 2-12). Internally, the software must distinguish between the nonspacing marks that apply to positions relative to the first part of the ligature glyph and those that apply to the second. (For a discussion of general methods of positioning nonspacing marks, see Section 5.13, Strategies for Handling Nonspacing Marks.) Figure 2-12. Multiple Base Characters @ f @÷ i . fi˜.

Multiple base characters do not commonly occur in most scripts. However, in some scripts, such as Arabic, this situation occurs quite often when vowel marks are used. It arises because of the large number of ligatures in Arabic, where each element of a ligature is a consonant, which in turn can have a vowel mark attached to it. Ligatures can even occur with three or more characters merging; vowel marks may be attached to each part.

Spacing Clones of European Diacritical Marks By convention, diacritical marks used by the Unicode Standard may be exhibited in (appar- ent) isolation by applying them to U+0020  or to U+00A0   . This tac- tic might be employed, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. The Unicode Standard separately encodes clones of many common European diacritical marks that are spacing characters, largely to provide compatibility with existing character set standards. These related charac- ters are cross-referenced in the names list in Chapter 14, Code Charts.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 27 2.7 Special Character and Noncharacter Values General Structure

2.7 Special Character and Noncharacter Values The Unicode Standard includes a small number of important characters with special behavior; some of them are introduced in this section. It is important that implementa- tions treat these characters properly. For a list of these and similar characters, see Section 3.9, Special Character Properties; for more information about such characters, see Section 13.1, Control Codes, Section 13.2, Layout Controls, and Section 13.6, Specials.

Byte Order Mark (BOM) The canonical encoding form of Unicode plain text as a sequence of 16-bit codes is sensitive to the byte ordering that is used when serializing text into a sequence of bytes, such as when writing to a file or transferring across a network. Some processors place the least significant byte in the initial position; others place the most significant byte in the initial position. Ide- ally, all implementations of the Unicode Standard would follow only one set of byte order rules, but this scheme would force one class of processors to swap the byte order on reading and writing plain text files, even when the file never leaves the system on which it was created. To have an efficient way to indicate which byte order is used in a text, the Unicode Standard contains two code values, U+FEFF   -  () and U+FFFE (not a character code), which are the byte-ordered mirror images of one another. The byte order mark is not a that selects the byte order of the text; rather, its function is to notify recipients as to which byte ordering is used in a file. Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file as containing Unicode text. The sequence FE16 FF16 (or its byte-reversed counterpart, FF16 FE16) is exceedingly rare at the outset of text files that use other character encodings. It is therefore not likely to be confused with real text data. The same is true for both single-byte and multibyte encodings. Data streams (or files) that begin with U+FEFF byte order mark are likely to contain Uni- code values. It is recommended that applications sending or receiving untyped data streams of coded characters use this signature. If other signaling methods are used, signa- tures should not be employed. Conformance to the Unicode Standard does not requires the use of the BOM as such a sig- nature. See Section 13.6, Specials, for more information on byte order mark and its use as an encoding signature.

Special Noncharacter Values U+FFFF and U+FFFE. These code values are not used to represent Unicode characters. U+FFFF is reserved for private program use as a sentinel or other signal. (Notice that U+FFFF is a 16-bit representation of –1 in two’s-complement notation.) Programs receiv- ing this code are not required to interpret it in any way. It is good practice, however, to rec- ognize this code as a noncharacter value and to take appropriate action, such as by indicating possible corruption of the text. U+FFFE is similar in all respects to U+FFFF, except that it is also the mirror image of U+FEFF   -  (byte order mark). The presence of U+FFFE constitutes a strong hint that the text in question is byte- reversed. See Section 13.6, Specials.

28 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.8 Controls and Control Sequences

Separators Line and Paragraph Separators. For historical reasons, and line feed are not used consistently across different systems. The Unicode Standard provides (and encourages the use of) the line and paragraph separator characters to provide clear informa- tion about where line and paragraph boundaries occur. Paragraph separation is also required for use with the bidirectional algorithm (see Chapter 3, Conformance). As these entities are separator codes, it is not necessary to start the first line or paragraph or to end the last line or paragraph with them. Also see Section 13.2, Layout Controls. Interaction with CR/LF. The Unicode Standard does not prescribe specific semantics for U+000D   (CR) and U+000A   (LF). These codes are provided to represent any CR or LF characters employed by a higher-level protocol or retained in text translated from other standards. It is left to each application to interpret these codes, to decide whether to require their use, and to determine whether CR/LF pairs or single codes are needed. See also Section 5.9, Line Handling, and Unicode Technical Report #13, “Uni- code Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.

Layout and Format Control Characters The Unicode Standard defines several characters that are used to control joining behavior, bidirectional ordering control, and alternative formats for display. These characters are explicitly defined as not affecting line-breaking behavior. Unlike space characters or other delimiters, they do not serve to indicate word, line, or other unit boundaries. Their specific use in layout and formatting is described in Section 13.2, Layout Controls.

The Replacement Character U+FFFD   is the general in the Unicode Standard. It can be substituted for any “unknown” character in another encoding that can- not be mapped in terms of known Unicode values (see Section 5.3, Unknown and Missing Characters, and Section 13.6, Specials).

2.8 Controls and Control Sequences

Control Characters The Unicode Standard provides 65 code values for the representation of control characters. These ranges are U+0000..U+001F and U+007F..U+009F, which correspond to the 8-bit controls 0016 to 1F16 (C0 controls) and 7F16 to 9F16 (delete and C1 controls). For example, the 8-bit version of horizontal tab (HT) is at 0916; the Unicode Standard encodes tab at U+0009. When converting control codes from existing 8-bit text, they are merely zero- extended to the full 16 bits of Unicode characters. Programs that conform to the Unicode Standard may treat these 16-bit control codes in exactly the same way as they treat their 7- and 8-bit equivalents in other protocols, such as ISO/IEC 2022 and ISO/IEC 6429. Such usage constitutes a higher-level protocol and is beyond the scope of the Unicode Standard. Similarly, the use of ISO/IEC 6429:1992 control sequences (extended to 16 bits) for controlling bidirectional formatting is a legitimate higher-level protocol layered on top of the plain text of the Unicode Standard. As with all higher-level protocols, both the sender and the receiver must agree upon a common proto- col beforehand.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 29 2.9 Conforming to the Unicode Standard General Structure

Escape Sequences. In converting text containing escape sequences to the Unicode character encoding, text must be converted to the equivalent Unicode characters. Converting escape sequences into Unicode characters on a character-by-character basis (for instance, ESC–A turns into U+001B , U+0041    ) allows the reverse conver- sion to be performed without forcing the conversion program to recognize the escape sequence as such. Control Code Sequences Encoding Additional Information about Text. If a system uses sequences beginning with control codes to embed additional information about text (such as formatting attributes or structure), then such sequences form a higher-level protocol outside the scope of the Unicode Standard. Such higher-level protocols are not specified by the Unicode Standard; their existence cannot be assumed without a separate agreement between the parties interchanging such data.

Representing Control Sequences Control sequences can be represented in the Unicode encoding design but must then be represented in terms of 16-bit characters. For example, suppose that an application allows embedded font information to be transmitted by means of an 8-bit sequence. In the follow- ing, the notation ^A refers to the C0 control code 0116, ^B refers to the C0 control code 0216, and so on: ^ATimes^B = 01,54,69,6D,65,73,02 Then the corresponding sequence of Unicode character codes would be ^ATimes^B = 0001,0054,0069,006D,0065,0073,0002 That is, each Unicode character code is a 16-bit zero-extended code value of the corre- sponding 8-bit code value. Where the embedded data are not interpreted as a sequence of characters by the protocol, the information could be encoded as follows: ^ATimes^B = 0001,5469,6D65,7300,0002 The data could never be encoded as ^ATimes^B = 0154,696D,6573,0200 because in the Unicode character encoding this sequence represents four characters—      (U+0154), two Han characters (U+696D and U+6573, respectively), and        (U+0200). None of these characters is a control character. If a control sequence contains embedded binary data, then the data bytes do not necessarily need to be zero-extended because the control sequence constitutes a higher protocol. However, doing so allows code conversion algorithms to suc- ceed even in the absence of explicit knowledge of employed control sequences.

2.9 Conforming to the Unicode Standard Chapter 3, Conformance, specifies the set of unambiguous criteria to which a Unicode- conformant implementation must adhere so that it can interoperate with other confor- mant implementations. The following section gives examples of conformance and non- conformance to complement the formal statement of conformance. An implementation that conforms to the Unicode Standard has the following characteris- tics:

30 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.9 Conforming to the Unicode Standard

• It treats characters as 16-bit units.  U+2020 (that is, 202016) is the single Unicode character ‘†’, not two ASCII spaces. • It interprets characters according to the identities, properties, and rules defined for them in this standard. U+2423 is ‘ˆ’  , not ‘¨’ hiragana small i (which is the meaning of the bytes 242316 in JIS). U+00D4 ‘ô’ is equivalent to U+004F ‘o’ followed by U+0302 ‘þ’ó, but not equivalent to U+0302 followed by U+004F. U+05D0 ‘µ’ followed by U+05D1 ‘¶’ looks like ‘¶µ’, not ‘µ¶’ when dis- played. When an implementation supports Arabic or Hebrew characters and displays those characters, they must be ordered according to the bidirec- tional algorithm described in Section 3.12, Bidirectional Behavior. When an implementation supports Arabic, Devanagari, Tamil, or other shaping characters and displays those characters, at a minimum the characters are shaped according to the appropriate character block descriptions given in Section 8.2, Arabic, Section 9.1, Devanagari, or Section 9.6, Tamil. (More sophisticated shaping can be used if available.) • It does not use unassigned codes. U+2073 is unassigned and not usable for ‘3’ (superscript 3) or any other character. • It does not corrupt unknown characters. U+2029 is   and should not be dropped by appli- cations that do not yet support it. U+03A1 “P”     should not be changed to U+00A1 (first byte dropped), U+0050 (mapped to Latin letter P), U+A103 (bytes reversed), or anything other than U+03A1. However, it is acceptable for a conforming implementation: • To support only a subset of the Unicode characters. An application might not provide mathematical symbols or the , for example. • To transform data knowingly. Uppercase conversion: ‘a’ transformed to ‘A’ Romaji to kana: ‘kyo’ transformed to ²ø U+247D ‘(10)’ decomposed to 0028 0031 0030 0029 • To build higher-level protocols on the character set. Compression of characters Use of rich text file formats • To define characters in the Private Use Area.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 31 2.10 Referencing Versions of the Unicode Standard General Structure

Examples of characters that might be encoded in the Private Use Area include supplementary ideographic characters (gaiji) or existing corpo- rate logo characters. • To not support the bidirectional algorithm or character shaping in implemen- tations that do not support complex scripts, such as Arabic and Devanagari. • To not support the bidirectional algorithm or character shaping in implemen- tations that do not display characters, such as on servers or in programs that simply parse or transcode text, such as an XML parser. Code conversion from other standards to the Unicode Standard will be considered confor- mant if the matching table produces accurate conversions in both directions.

Characters Not Used in a Subset The Unicode Standard does not require that an application be capable of interpreting and rendering all Unicode characters so as to be conformant. Many systems will have fonts only for some scripts, but not for others; sorting and other text-processing rules may be imple- mented only for a limited set of languages. As a result, an implementation is able to inter- pret a subset of characters. The Unicode Standard provides no formalized method for identifying this subset. Further- more, this subset is typically different for different aspects of an implementation. For example, an application may be able to read, write, and store any 16-bit character, and to sort one subset according to the rules of one or more languages (and the rest arbitrarily), but have access only to fonts for a single script. The same implementation may be able to render additional scripts as soon as additional fonts are installed in its environment. There- fore, the subset of interpretable characters is typically not a static concept. Conformance to the Unicode Standard implies that whenever text purports to be unmodi- fied, uninterpretable characters must not be removed or altered. (See also Section 3.1, Con- formance Requirements.)

2.10 Referencing Versions of the Unicode Standard For most character encodings, the character repertoire is fixed (and often small). Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire is conceived of as creating a new repertoire, which then will be given its own catalog number, constituting a new object. For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is potentially a member of the actual set to be encoded, regardless of whether the character is currently known. Each new version of the Unicode Standard replaces the previous one and makes it obsolete, but implementations—and more significantly, data—are not updated instantly. In general, major and minor version changes include new characters, which do not create particular problems with old data. The Unicode Technical Committee will neither remove nor move characters, but characters may be deprecated. This approach does not remove them from the standard or from existing data. The will never be used for a different charac- ter, but its use is strongly discouraged. Implementations should be prepared to be forward-compatible with respect to Unicode versions. That is, they should accept text that may be expressed in future versions of this

32 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.10 Referencing Versions of the Unicode Standard standard, recognizing that new characters may be assigned in those versions. Thus they should handle incoming unassigned code points as they do unsupported characters. (See Section 5.3, Unknown and Missing Characters.) A version change may also involve changes to the properties of existing characters. When this situation occurs, modifications are made to UnicodeData.txt and other relevant con- tributing data files, and a new update version is issued for the standard. Changes to the data files may alter program behavior that depends on them. Version Numbering. Version numbers for the standard consist of three fields: the major version, the minor version, and the update version. The differences among them are as fol- lows: • Major—significant additions to the standard, published as a book. • Minor—character additions or more significant normative changes, published as a technical report on the Unicode Web site. • Update—any other changes to normative or important informative portions of the standard that could change program behavior. These changes are reflected in a new UnicodeData.txt file and other contributing data files of the Unicode Character Database. For additional information on the current and past versions of the Unicode standard, see http://www.unicode.org/unicode/standard/versions/ on the Unicode Web site.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 33 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 3 Conformance 3

This chapter defines conformance to the Unicode Standard in terms of the principles and encoding architecture it embodies. The first section consists of the conformance clauses, followed by sections that define more precisely the technical terms used in those clauses. The remaining sections contain the formal algorithms that are part of conformance and referred to by the conformance clause. These algorithms specify the required results rather than the specific implementation; all implementations that produce results identical to the results of the formal algorithms are conformant. In this chapter, conformance subclauses are identified with a letter C. Definitions are iden- tified with the letter D. Bulleted items are explanatory comments regarding definitions or subclauses. Except for Section 3.12, Bidirectional Behavior, the numbering of rules and definitions matches that of The Unicode Standard, Version 2.0. Where new rules and definitions were added, letters are used with numbers—for example, D7a.

3.1 Conformance Requirements This section specifies the formal conformance requirements for processes implementing Version 3.0 of the Unicode Standard. that this clause has been revised from the previ- ous versions of the Unicode Standard. These revisions do not change the substance of the conformance requirements previously set forth, but rather are formalized and extended to allow for the use of transformation formats. Implementations that satisfied the confor- mance clause of the previous versions of the Unicode Standard will satisfy this revised clause.

Byte Ordering C1 A process shall interpret Unicode code values as 16-bit quantities. • Unicode values can be stored in native 16-bit machine words. • For information on use of wchar_t or other programming language types to rep- resent Unicode values, see Section 5.2, ANSI/ISO C wchar_t. C2 The Unicode Standard does not specify any order of bytes inside a Unicode value. • Machine architectures differ in ordering in terms of whether the most significant byte or the least significant byte comes first. These sequences are known as “big- endian” and “little-endian” orders, respectively. C3 A process shall interpret a Unicode value that has been serialized into a sequence of bytes by most significant byte first, in the absence of higher-level protocols. • The majority of all interchange occurs with processes running on the same or a sim- ilar configuration. As a result, intradomain interchange of Unicode text in the

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 37 3.1 Conformance Requirements Conformance

domain-specific byte order is fully conformant and limits the role of the canonical byte order to interchange of Unicode text across domain, or where the nature of the originating domain is unknown. (For a discussion of the use of byte order mark to indicate byte orderings, see Section 2.7, Special Character and Noncharacter Values.)

Invalid Code Values C4 A process shall not interpret an unpaired high- or low-surrogate as an abstract character. C5 A process shall not interpret either U+FFFE or U+FFFF as an abstract character. C6 A process shall not interpret any unassigned code value as an abstract character. • These clauses do not preclude the assignment of certain generic semantics (for example, rendering with a glyph to indicate the character block) that allow for graceful behavior in the presence of code values that are outside a supported subset or code values that are unpaired surrogates. • Private-use code values are assigned, but can be given any interpretation by confor- mant processes.

Interpretation C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded charac- ter representation. • This restriction does not preclude internal transformations that are never visible external to the process. C8 A process shall not assume that it is required to interpret any particular coded character representation. • Any means for specifying a subset of characters that a process can interpret is out- side the scope of this standard. • The semantics of a code value in the Private Use Area is outside the scope of this standard. • Although these clauses are not intended to preclude enumerations or specifications of the characters that a process or system is able to interpret, they do separate sup- ported subset enumerations from the question of conformance. In real life, any sys- tem may occasionally receive an unfamiliar character code that it is unable to interpret. C9 A process shall not assume that the interpretations of two canonical-equivalent charac- ter sequences are distinct. • Ideally, an implementation would always interpret two canonical-equivalent char- acter sequences identically. There are practical circumstances under which imple- mentations may reasonably distinguish them. • Even processes that normally do not distinguish between canonical-equivalent character sequences can have reasonable exception behavior. Some examples of this behavior include graceful fallback processing by processes unable to support correct positioning of nonspacing marks; “Show Hidden Text” modes that reveal memory representation structure; and the choice of ignoring collating behavior of

38 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.1 Conformance Requirements

combining sequences that are not part of the repertoire of a specified language (see Section 5.13, Strategies for Handling Nonspacing Marks).

Modification C10 A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences, if that process purports not to modify the interpretation of that coded character represen- tation. • Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text. • Replacement or deletion of a character sequence that the process cannot or does not interpret does modify the interpretation of the text. • Changing the bit or byte ordering when transforming between different machine architectures does not modify the interpretation of the text. • Transforming to a different encoding form does not modify the interpretation of the text.

Transformations C11 When a process interprets a byte sequence in a Unicode Transformation Format, it shall interpret that byte sequence in accordance with the character semantics established by this standard for the corresponding Unicode character sequence. C12 When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition.

Bidirectional Text C13 A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the bidirectional algorithm had been applied to the text, in the absence of higher-level protocols (see Section 3.12, Bidirectional Behavior).

Unicode Technical Reports The following technical reports are approved and considered part of Version 3.0 of the Uni- code Standard. These reports may contain either normative or informative material, or both. Any reference to Version 3.0 of the standard automatically includes these technical reports. • UTR #11: East Asian Width, Version 5.0 • UTR #13: Unicode Newline Guidelines, Version 5.0 • UTR #14: Line Breaking Properties, Version 6.0 • UTR #15: Unicode Normalization Forms, Version 18.0

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 39 3.2 Semantics Conformance

3.2 Semantics This and the following sections more precisely define the terms that are used in the con- formance clauses. D1 Normative properties and behavior: The following are normative character properties and normative behavior of the Unicode Standard: 1. Simple properties 2. Character combination 3. Canonical decomposition 4. Compatibility decomposition 5. Surrogate property 6. Canonical ordering behavior 7. Bidirectional behavior, as interpreted according to the Unicode bidirectional algorithm 8. Conjoining jamo behavior, as interpreted according to Section 3.11, Conjoining Jamo Behavior D2 Character semantics: The semantics of a character are established by its character name, representative glyph, and normative properties and behavior. • A character may have a broader range of use than the most literal interpretation of its name might indicate; the coded representation, name, and representative glyph need to be taken in context when establishing the semantics of a character. For example, U+002E   can represent a sentence period, an abbreviation period, a decimal number separator in English, a thousands number separator in German, and so on. • Consistency with the representative glyph does not require that the images be iden- tical or even graphically similar; rather, it means that both images are generally rec- ognized to be representations of the same character. Representing the character U+0061     by the glyph “X” would violate its character iden- tity. • Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics. • The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols.

3.3 Characters and Coded Representations D3 Abstract character: a unit of information used for the organization, control, or repre- sentation of textual data. • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, numeric, aural, or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats.

40 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.3 Characters and Coded Representations

• An abstract character has no concrete form and should not be confused with a glyph. • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme. • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters. • Abstract characters not directly encoded by the Unicode Standard can be repre- sented by the use of combining character sequences. D4 Abstract character sequence: an ordered sequence of abstract characters. Each Unicode abstract character is encoded one or more times. Each encoding, which con- sists of the relationship between an abstract character and its scalar value (see D28), is called an encoded character. Each scalar value is represented in either of two ways: as a single code value or as a sequence of two surrogate code values. D5 Code value: the minimal bit combination that can represent a unit of encoded text for processing or interchange. • Other character encoding standards typically use code values defined as 8-bit units. The same is true for the UTF-8 transformation format of the Unicode Standard. However, the code values used in the UTF-16 form of the Unicode Standard are 16- bit units. These 16-bit code values are also known simply as Unicode values. • A code value is also referred to as a code unit in the information industry. • A single abstract character may correspond to more than one code value—for example, U+00C5 Å       and U+212B Å  . • Multiple code values may be required to represent a single abstract character. For example, a byte is the code unit in SJIS: characters such as “a” can be represented with a single code value, whereas ideographs require two code values. • In some encodings, specific code units cannot be used to represent an encoded character in isolation. This restriction includes a single surrogate (high or low) in Unicode, the bytes 80–FF in UTF-8, or the bytes 81–9F, E0– in SJIS. D6 Coded character representation: an ordered sequence of one or more code values that is associated with an abstract character in a given character repertoire. • A Unicode abstract character is generally encoded by a single Unicode code value; the only exception involves surrogate pairs (which are provided for future exten- sion, but are not currently used to represent any abstract characters). D7 Coded character sequence: an ordered sequence of coded character representations. Unless specified otherwise for clarity, in the text of the Unicode Standard the term character alone generally designates a coded character representation. Similarly, the term character sequence alone generally designates a coded character sequence. D7a Deprecated character: a coded character whose use is strongly discouraged. Such characters are retained in the standard, but should not be used. • Deprecated characters are retained in the standard so that previously conforming data stay conformant in future versions of the standard. Deprecated characters are to be distinguished from obsolete characters. • Obsolete characters are historical. They do not occur in modern text, but they are not deprecated; their use is not discouraged.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 41 3.4 Simple Properties Conformance

D8 Higher-level protocol: any agreement on the interpretation of Unicode characters that extends beyond the scope of this standard. Such an agreement need not be for- mally announced in data; it may be implicit in the context.

3.4 Simple Properties The Unicode Standard, Version 3.0, defines the normative simple character properties of case, numeric value, directionality, and mirrored. Chapter 4, Character Properties, contains explicit mappings of characters to character properties. These mappings represent the default properties for conformant processes in the absence of explicit, overriding, higher- level protocols. Additional properties that are specific to particular characters (such as the definition and use of the right-to-left override character or zero-width spaces) are discussed in the relevant sections of this standard. • The Unicode Character Database contains additional properties, such as category and case mappings, that are informative rather than normative. The interpretation of some properties (such as the case of a character) is independent of context, whereas the interpretation of others (such as directionality) is applicable to a char- acter sequence as a whole, rather than to the individual characters that compose the sequence. D9 Directionality property: a property of every that determines its horizontal ordering as specified in Section 3.12, Bidirectional Behavior. • Interpretation of directional properties according to the Unicode bidirectional algo- rithm is needed for layout of right-to-left scripts such as Arabic and Hebrew. D10 Mirrored property: the property of characters whose images are mirrored horizon- tally in text that is laid out from right to left (versus left to right). (See Section 4.7, Mirrored—Normative.) • In other words, U+0028   is interpreted as an opening parenthesis; in a left-to-right context, this character will appear as “(”; in a right-to-left context, it will be mirrored and appear as “)”. • This is the default behavior in Unicode text. (For more information, see the “Semantics of Paired Punctuation” subsection in Section 6.1, General Punctuation.) D10a Case property: a property of characters in certain alphabets whereby certain charac- ters are considered variants of a single letter. (See Section 4.1, Case—Normative.) D10b Numeric value property: a property of characters used to represent numbers. (See Section 4.6, Numeric Value—Normative.) D11 Special character properties: The behavior of most characters does not require special attention in this standard. However, certain characters exhibit special behavior, which is described in the character block descriptions. These characters are listed in Section 3.9, Special Character Properties. D12 Private use: Unicode values from U+E000 to U+F8FF and surrogate pairs (see Section 3.7, Surrogates) whose high-surrogate is from U+DB80 to U+DBFF are available for private use.

42 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.5 Combination

3.5 Combination D13 Base character: a character that does not graphically combine with preceding charac- ters, and that is neither a control nor a format character. • Most Unicode characters are base characters. This sense of graphic combination does not preclude the presentation of base characters from adopting different con- textual forms or participating in ligatures. D14 Combining character: a character that graphically combines with a preceding base character. The combining character is said to apply to that base character. • These characters are not used in isolation (unless they are being described). They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. • Even though a combining character is intended to be presented in graphical combi- nation with a base character, circumstances may arise where either (1) no base char- acter precedes the combining character or (2) a process is unable to perform graphical combination. In both cases, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character. • The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle. • Combining characters generally take on the properties of their base character, while retaining their combining property. • Control and format characters, such as tab or right-left mark, are not base charac- ters. Combining characters do not apply to them. D15 Nonspacing mark: a combining character whose positioning in presentation is dependent on its base character. It generally does not consume space along the visual baseline in and of itself. • Such characters may be large enough to affect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an “i” may affect spacing (“î”), as might the character U+20DD -   . D16 Spacing mark: a combining character that is not a nonspacing mark. • Examples include U+093F    . In general, the behavior of spacing marks does not differ greatly from that of base characters. D17 Combining character sequence: a character sequence consisting of either a base char- acter followed by a sequence of one or more combining characters, or a sequence of one or more combining characters. • A combining character sequence is also referred to as a composite character sequence. D17a Defective combining character sequence: a combining character sequence that does not start with a base character. • Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 43 3.6 Decomposition Conformance

3.6 Decomposition D18 Decomposable character: a character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the names list of Section 14.1, Character Names List. It may also be known as a precomposed charac- ter or composite character. D19 Decomposition: a sequence of one or more characters that is equivalent to a decom- posable character. A full decomposition of a character sequence results from decom- posing each of the characters in the sequence until no characters can be further decomposed.

Compatibility Decomposition D20 Compatibility decomposition: the decomposition of a character that results from recursively applying both the compatibility mappings and the canonical mappings found in the names list of Section 14.1, Character Names List, and those described in Section 3.11, Conjoining Jamo Behavior, until no characters can be further decom- posed, and then reordering nonspacing marks according to Section 3.10, Canonical Ordering Behavior. • A compatibility decomposition may remove formatting information. D21 Compatibility character: a character that has a compatibility decomposition. • Compatibility characters are included in the Unicode Standard to represent distinc- tions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data. • Replacing a compatibility character by its decomposition may lose round-trip con- vertibility with a base standard. D22 Compatibility equivalent: Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Canonical Decomposition D23 Canonical decomposition: the decomposition of a character that results from recur- sively applying the canonical mappings found in the names list of Section 14.1, Character Names List, and those described in Section 3.11, Conjoining Jamo Behavior, until no characters can be further decomposed, and then reordering nonspacing marks according to Section 3.10, Canonical Ordering Behavior. • The canonical mappings are a subset of the compatibility mappings; however, a canonical decomposition does not remove formatting information. D24 Canonical equivalent: Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. • For example, the sequences and <ö> are canonical equiva- lents. Canonical equivalence is a Unicode property. It should not be confused with language-specific collation or matching, which may add additional equivalencies. For example, in Swedish, ö is treated as a completely different letter from o, collated after z. In German, ö is weakly equivalent to and collated with oe. In English, ö is just an o with a diacritic that indicates that it is pronounced separately from the pre- vious letter (as in coöperate) and is collated with o.

44 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.7 Surrogates

Note: For definitions of canonical composition and compatibility composition, see Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site.

3.7 Surrogates D25 High-surrogate: a Unicode code value in the range U+D800 through U+DBFF. D26 Low-surrogate: a Unicode code value in the range U+DC00 through U+DFFF. D27 Surrogate pair: a coded character representation for a single abstract character that consists of a sequence of two Unicode values, where the first value of the pair is a high-surrogate and the second is a low-surrogate. • Unlike combining characters, which have independent semantics and properties, high- and low-surrogates have no interpretation when they do not appear as part of a surrogate pair. • Surrogate pairs are designed to allow representation of characters in future exten- sions of the Unicode Standard. There are no such currently assigned characters in this version of the standard, but it is widely expected that such characters will be added in the not too distant future. For more information, see Section 13.4, Surro- gates Area, and Section 5.4, Handling Surrogate Pairs.

D28 Unicode scalar value: a number from 0 to 10FFFF16 defined by applying the fol- lowing algorithm to a character sequence S:

N = U If S is a single, nonsurrogate value

N = (H – D80016) * 40016 + If S is a surrogate pair (L – DC0016) + 1000016

• Unicode scalar values are defined for use by standards such as SGML, XML, and HTML, which require a scalar value associated with abstract characters. • This algorithm is identical with the ISO/IEC 10646 algorithm used to transform UTF-16 into UCS-4 (for more information, see Appendix C, Relationship to ISO/ IEC 10646). • The reverse mapping from the Unicode scalar value to a surrogate pair is given by

H = (S – 1000016) / 40016 + D80016

L = (S – 1000016) % 40016 + DC0016 The operators “/” and “%” are as defined in Section 0.2, Notational Conventions.

• A Unicode scalar value is also referred to as a code position or a code point in the information industry.

3.8 Transformations More than one representation of Unicode data can be conformant to the Unicode Stan- dard. Chief among them is UTF-8, discussed in Section 2.3, Encoding Forms, and Appendix C.3, UCS Transformation Formats. In addition, there are compression transfor- mations such as the one described in the Unicode Technical Report #6, “A Standard

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 45 3.8 Transformations Conformance

Compression Scheme for Unicode” on the CD-ROM or the up-to-date version on the Uni- code Web site. D29 A Unicode (or UCS) transformation format (UTF) transforms each Unicode scalar value into a unique sequence of code values. A UTF may specify a byte order for the serialization of the code values into bytes. A UTF may also specify the use of a byte order mark. • Code values are particular units of computer storage specified by the transforma- tion format—for example, 16-bit integers or bytes. In the latter case, a code value sequence can be referred to as a byte sequence. • Any sequence of code values that would correspond to a scalar value greater than 10FFFF16 is illegal. Because every Unicode coded character sequence maps to a unique sequence of code values in a given UTF, a reverse mapping can be derived. Thus every UTF supports lossless round- trip transcoding: mapping from any Unicode coded character sequence S to a sequence of code values and back will produce S again. To ensure that round-trip transcoding is possi- ble, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE16, FFFF16, and unpaired surrogates. D30 For a given UTF, a code value sequence that cannot be produced from any sequence of Unicode scalar values is called an ill-formed code value sequence. D31 For a given UTF, a code value sequence that cannot be mapped back to a sequence of Unicode scalar values is called an illegal code value sequence.

• For example, in UTF-8 every code value of the form 110xxxxx2 must be followed with a code value of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxx2 is illegal and must never be generated. When faced with this illegal code value sequence while transforming or interpreting, a UTF-8 conformant process must treat the first code value 110xxxxx2 as an illegal termination error—for example, by signaling an error, filtering the code value out, or representing the code value with a marker such as U+FFFD  . In the latter two cases, it will continue processing at the second code value 0xxxxxx2. D32 For a given UTF, an ill-formed code value sequence that is not illegal is called an irregular code value sequence. • To make implementations simpler and faster, some transformation formats may allow irregular code value sequences without requiring error handling. For exam- ple, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 con- formant process may map the code value sequence C0 80 (110000002 100000002) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence—it shall generate the sequence 00 (000000002) instead. A conformant process shall not use irregular code value sequences to encode out-of-band information. D33 UTF-16BE is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in big-endian format. An initial sequence corresponding to U+FEFF is interpreted as a   - . • In UTF-16BE, <004D 0061 0072 006B> is serialized as <00 4D 00 61 00 72 00 6B>. D34 UTF-16LE is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a   - . • In UTF-16LE, <004D 0061 0072 006B> is serialized as <4D 00 61 00 72 00 6B 00>.

46 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.9 Special Character Properties

D35 UTF-16 is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in either big-endian or little-endian format. An initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark: it is used to distinguish between the two byte orders. The byte order mark is not considered part of the content of the text. A serialization of Unicode values into UTF-16 may or may not begin with a byte order mark. • In UTF-16, <004D 0061 0072 006B> is serialized as , , or <00 4D 00 61 00 72 00 6B>. • The term UTF-16 can be used ambiguously. When referring to the encoding of Uni- code in memory, there is no associated byte orientation and a BOM is never used. When referring to a serialization of Unicode into bytes, it may have a BOM and can have either byte orientation. D36 UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3-1. • In UTF-8, <004D 0061 0072 006B> is serialized as <4D 61 72 6B>. Table 3-1 specifies the bit distribution from a Unicode character (or surrogate pair) into the one- to four-byte values of the corresponding UTF-8 sequence. Note that the four-byte form for surrogate pairs involves an addition of 1000016, to account for the starting offset to the encoded values referenced by surrogates. The definition of UTF-8 in Amendment 2 to ISO/IEC 10646 also allows for the use of five- and six-byte sequences to encode charac- ters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters. Table 3-1. UTF-8 Bit Distribution

Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte 000000000xxxxxxx 000000000xxxxxxx 0xxxxxxx 00000yyyyyxxxxxx 00000yyyyyxxxxxx 110yyyyy 10xxxxxx zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuua 10uuzzzz 10yyyyyy 10xxxxxx 110111yyyyxxxxxx

a. Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Sur- rogates).

When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irreg- ular UTF-8 sequences shall not be used for encoding any other information. When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.

3.9 Special Character Properties The behavior of most characters does not require special attention in this standard. How- ever, the following characters exhibit special behavior, as described in Chapter 13, Special Areas and Format Characters, and in Chapter 4, Character Properties.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 47 3.9 Special Character Properties Conformance

• Line boundary control 0009 HORIZONTAL TAB 000A LINE FEED 000C FORM FEED 000D CARRIAGE RETURN 0020 SPACE 00A0 NO-BREAK SPACE 0F0B TIBETAN MARK INTERSYLLABIC TSHEG 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR 2000 QUAD 2002 EN SPACE 2003 SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2007 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 200B ZERO WIDTH SPACE 2011 NON-BREAKING HYPHEN 2028 LINE SEPARATOR 2029 PARAGRAPH SEPARATOR 202F NARROW NO-BREAK SPACE FEFF ZERO WIDTH NO-BREAK SPACE • Hyphenation control 002D HYPHEN-MINUS 00AD 058A ARMENIAN HYPHEN 1806 MONGOLIAN TODO SOFT HYPHEN 2010 HYPHEN 2011 NON-BREAKING HYPHEN 2027 HYPHENATION POINT • Fraction formatting 2044 FRACTION • Special behavior with nonspacing marks 0020 SPACE 0069 LATIN SMALL LETTER I 006A LATIN SMALL LETTER J 00A0 NO-BREAK SPACE 0131 LATIN SMALL LETTER DOTLESS I • Double nonspacing marks 0360 COMBINING DOUBLE TILDE 0361 COMBINING DOUBLE INVERTED 0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW • Joining 200C ZERO WIDTH NON-JOINER 200D ZERO WIDTH JOINER • Bidirectional ordering 200E LEFT-TO-RIGHT MARK 200F RIGHT-TO-LEFT MARK 202A LEFT-TO-RIGHT EMBEDDING 202B RIGHT-TO-LEFT EMBEDDING 202C POP DIRECTIONAL FORMATTING 202D LEFT-TO-RIGHT OVERRIDE

48 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.9 Special Character Properties

202E RIGHT-TO-LEFT OVERRIDE • Alternate formatting 206A INHIBIT SYMMETRIC SWAPPING 206B ACTIVATE SYMMETRIC SWAPPING 206C INHIBIT ARABIC FORM SHAPING 206D ACTIVATE ARABIC FORM SHAPING 206E NATIONAL DIGIT SHAPES 206F NOMINAL DIGIT SHAPES • Syriac abbreviation 070F SYRIAC ABBREVIATION MARK • Indic dead-character formation 094D DEVANAGARI SIGN 09CD BENGALI SIGN VIRAMA 0A4D GURMUKHI SIGN VIRAMA 0ACD GUJARATI SIGN VIRAMA 0B4D ORIYA SIGN VIRAMA 0BCD TAMIL SIGN VIRAMA 0C4D TELUGU SIGN VIRAMA 0CCD KANNADA SIGN VIRAMA 0D4D MALAYALAM SIGN VIRAMA 0DCA SINHALA SIGN AL-LAKUNA 0F84 TIBETAN SIGN HALANTA 1039 MYANMAR SIGN VIRAMA 17D2 KHMER SIGN COENG • Mongolian variant selectors 180B MONGOLIAN FREE VARIATION SELECTOR ONE 180C MONGOLIAN FREE VARIATION SELECTOR TWO 180D MONGOLIAN FREE VARIATION SELECTOR THREE 180E MONGOLIAN VOWEL SEPARATOR • Ideographic variation indication 303E IDEOGRAPHIC VARIATION INDICATOR • Ideographic description 2FF0 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT 2FF1 IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW 2FF2 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT 2FF3 IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW 2FF4 IDEOGRAPHIC DESCRIPTION CHARACTER FULL SUR- ROUND 2FF5 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE 2FF6 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW 2FF7 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT 2FF8 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT 2FF9 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT 2FFA IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT 2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 49 3.10 Canonical Ordering Behavior Conformance

• Interlinear annotation FFF9 INTERLINEAR ANNOTATION ANCHOR FFFA INTERLINEAR ANNOTATION SEPARATOR FFFB INTERLINEAR ANNOTATION TERMINATOR • Object replacement FFFC OBJECT REPLACEMENT CHARACTER • Code conversion fallback FFFD REPLACEMENT CHARACTER • Byte order signature FEFF ZERO WIDTH NO-BREAK SPACE

3.10 Canonical Ordering Behavior The purpose of this section is to provide unambiguous interpretation of a combining char- acter sequence. In the Unicode Standard, the order of characters in a combining character sequence is interpreted according to the following principles: • In the Unicode Standard, all combining characters are encoded following the base characters to which they apply. Thus the Unicode sequence U+0061 “a”     + U+0308 “þ”í   + U+0075 “u”     is unambiguously interpreted (and displayed) as “äu”, not “aü”. • Enclosing nonspacing marks surround all previous characters up to and including the base character (see Figure 3-1). They thus successively surround previous enclosing nonspacing marks. Figure 3-1. Enclosing Marks

a + @ + @¨ + @ ➠ a¨

• Double diacritics always bind more loosely than other nonspacing marks. When rendering, the double diacritic will float above other diacritics, excluding enclosing diacritics (see Figure 3-2). Figure 3-2. Positioning of Double Diacritics

~ ~ o + @ˆ + @ + o + @¨ ➠ ôö ~ ~ o + @ + @ˆ + o + @¨ ➠ ôö

• Combining marks with the same combining class are generally positioned graphi- cally outward from the base character they modify. Some specific nonspacing marks override the default stacking behavior by being positioned side-by-side rather than stacking or by ligaturing with an adjacent nonspacing mark. When positioned side- by-side, the order of codes is reflected by positioning in the dominant order of the script with which they are used.

50 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.10 Canonical Ordering Behavior

• If combining characters have different combining classes—for example, when one nonspacing mark is above a base character form and another is below it—then no distinction of graphic form or semantic will result. The following subsections formalize these principles in terms of a normative list of com- bining classes and an algorithmic statement of how to use those combining classes to unambiguously interpret a combining character sequence.

Combining Classes The Unicode Standard treats sequences of nonspacing marks as equivalent if they do not typographically interact. The canonical ordering algorithm defines a method for determin- ing which sequences interact and gives a canonical ordering of these sequences for use in equivalence comparisons. D37 Combining class: a numeric value given to each combining Unicode character that determines with which other combining characters it typographically interacts. • See Section 4.2, Combining Classes—Normative, for a list of the combining classes for Unicode characters. Characters have the same class if they interact typographically, and different classes if they do not. • Enclosing characters and spacing combining characters have the class of their base characters. • The particular numeric value of the combining class does not have any special sig- nificance; the intent of providing the numeric values is only to distinguish the com- bining classes as being different, for use in equivalence comparisons.

Canonical Ordering The canonical ordering of a decomposed character sequence results from a sorting process that acts on each sequence of combining characters according to their combining class. Characters with combining class zero never sort relative to other characters, so the amount of work in the algorithm depends on the number of non-class-zero characters in a row. An implementation of this algorithm will be extremely fast for typical text. The algorithm described here represents a logical description of the process. Optimized algorithms can be used in implementations as long as they are equivalent—that is, as long as they produce the same result. More explicitly, the canonical ordering of a decomposed character sequence D results from the following algorithm. R1 For each character x in D, let p(x) be the combining class of x. R2 Whenever any pair (A, B) of adjacent characters in D is such that p(B) Å 0 & p(A) > p(B), exchange those characters. R3 Repeat step R2 until no exchanges can be made among any of the characters in D.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 51 3.11 Conjoining Jamo Behavior Conformance

Examples of this ordering appear in Table 3-2. Table 3-2. Sample Combining Classes

Combining Abbreviation Code Unicode Name Class 0a 0061     220 underdot 0323    230 diaeresis 0308   230 breve 0306   0a-underdot1EA1        0 a-diaeresis 00E4       0a-breve0103      

a + underdot + diaeresis Æ a + underdot + diaeresis a + diaeresis + underdot Æ a + underdot + diaeresis Because underdot has a lower combining class than diaeresis, the algorithm will return the a, then the underdot, then the diaeresis. However, because diaeresis and breve have the same combining class (because they interact typographically), they do not rearrange. a + breve + diaeresis Ç a + diaeresis + breve a + diaeresis + breve Ç a + breve + diaeresis Applying the algorithm gives the results shown in ble 3-3. Table 3-3. Canonical Ordering Results

Original Decompose Sort Result a-diaeresis + underdot a + diaeresis + underdot a + underdot + diaeresis a + underdot + diaeresis a + diaeresis + underdot a + underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a-underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a-diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + breve + diaeresis a + breve + diaeresis a-breve + diaeresis a + breve + diaeresis a + breve + diaeresis

Use with Collation When collation processes do not require correct sorting outside of a given domain, they are not required to invoke the canonical ordering algorithm for excluded characters. For exam- ple, a Greek collation process may not need to sort Cyrillic letters properly; in that case, it does not have to maximally decompose and reorder Cyrillic letters and may just choose to sort them according to Unicode order. For the complete treatment of collation, see Unicode Technical Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site.

3.11 Conjoining Jamo Behavior The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic syllable blocks as well as modern syllable blocks. This section describes how to

52 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.11 Conjoining Jamo Behavior

• Determine the syllable boundaries in a sequence of conjoining jamo characters • Compose jamo characters into Hangul syllables • Determine the canonical decomposition of Hangul syllables • Algorithmically determine the names of the Hangul syllable characters (For more information, see the “Hangul Syllables” and “Hangul Jamo” subsections in Section 10.4, Hangul.) The jamo characters can be classified into three sets of characters: choseong (leading conso- nants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). In the following discussion, these jamo are abbreviated as L (leading consonant), V (vowel), and T (trailing conso- nant); syllable breaks are shown by middle dots “·”; and non-jamo are shown by X.

Syllable Boundaries In rendering, a sequence of jamos is displayed as a series of syllable blocks. The following rules specify how to divide up an arbitrary sequence of jamos (including nonstandard sequences) into these syllable blocks. In these rules, a choseong filler (L ) is treated as a cho- f seong character, and a jungseong filler (V ) is treated as a jungseong. f Within any sequence of characters, a syllable break occurs between the pairs of characters shown in Table 3-4. All other sequences of Hangul jamos are considered to be part of the same syllable. Note that like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from forming a syllable. Table 3-4. Hangul Syllable Break Rules

Condition Example Any conjoining jamo and any non-jamo L·X, V·X, T·X, X·L, X·V, X·T A jongseong (trailing) and choseong (leading) T·L A jungseong (vowel) and a choseong (leading) V·L A jongseong (trailing) and jungseong (vowel) T·V

Standard Syllables A standard syllable block is composed of a sequence of choseong followed by a sequence of jungseong and optionally a sequence of jongseong (for example, S = LV or LVT ). A sequence of nonstandard syllable blocks can be transformed into a sequence of standard syllable blocks by inserting choseong fillers and jungseong fillers. Examples. In Table 3-5, row (1) shows syllable breaks in a standard sequence, row (2) shows syllable breaks in a nonstandard sequence, and row (3) shows how the sequence in (2) could be transformed into standard form by inserting fillers into each syllable. Table 3-5. Syllable Break Examples

No. Sequence Sequence with Syllable Breaks Marked 1 LVTLVLVLV L VL V T → LVT · LV · LV · LV · L V · L V T f f f f f f f f 2 LLTVLTLTVVLL → LLT · V · LT · LT · VV · LL 3 LLTVLTLTVVLL → LLV T · L V · LV T · LV T · L VV · LLV f f f f f f

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 53 3.11 Conjoining Jamo Behavior Conformance

Hangul Syllable Composition The following algorithm describes how to take a sequence of canonically decomposed char- acters D and compose Hangul syllables. Hangul composition and decomposition are sum- marized here, but for a more complete description, implementers must consult Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site. Note that, like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from composing. First, define the following constants: SBase = AC0016 LBase = 110016 VBase = 116116 TBase = 11A716 SCount = 11172 LCount = 19 VCount = 21 TCount = 28 NCount = VCount * TCount 1. Process D by composing the conjoining jamo wherever possible, according to the compatibility decomposition rules in Chapter 14, Code Charts. (Typical interchange of conjoining jamo will be in precomposed forms. In such cases, this step may not be necessary. Raw keyboard data, on the other hand, may be in the form of a compatibility decomposition.) 2. Let i represent the current position in the sequence D. Compute the following indices, which represent the ordinal number (zero-based) for each of the com- ponents of a syllable, and the index j, which represents the index of the last character in the syllable. LIndex = D[i] - LBase VIndex = D[i+1] - VBase TIndex = D[i+2] - TBase j = i + 2 3. If either of the first two characters is out of bounds (LIndex < 0 OR LIndex >= LCount OR VIndex < 0 OR VIndex >= VCount), then increment i, return to step 2, and continue from there. 4. If the third character is out of bounds (TIndex <= 0 or TIndex >= TCount), then it is not part of the syllable. Reset the following: TIndex = 0 j = i + 1 5. Replace the characters D[i] through D[j] by the Hangul syllable S, and set i to be j+1. S = (LIndex * VCount + VIndex) * TCount + TIndex + SBase. Example. The first three characters are U+1111 ñ    U+1171 Ö    U+11B6 ›   -

Compute the following indices, LIndex = 17 VIndex = 16 TIndex = 15

54 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior and replace the three characters by S = [(17 * 21) + 16] * 28 + 15 + SBase = D4DB16 = /

Hangul Syllable Decomposition The following describes the reverse mapping—how to take Hangul syllable S and derive the canonical decomposition D. This normative mapping for these characters is equivalent to the canonical mapping in the character charts for other characters. 1. Compute the index of the syllable: SIndex = S - SBase 2. If SIndex is in the range (0 <= SIndex < SCount), then compute the compo- nents as follows: L = LBase + SIndex / NCount V = VBase + (SIndex % NCount) / TCount T = TBase + SIndex % TCount The operators “/” and “%” are as defined in Section 0.2, Notational Conventions. 3. If T = TBase, then there is no trailing character, so replace S by the sequence . Otherwise, there is a trailing character, so replace S by the sequence . Example L = LBase + 17 V = VBase + 16 T = TBase + 15 D4DB16 → 111116, 117116, 11B616

Hangul Syllable Names The character names for Hangul syllables are derived from the decomposition by starting with the string  , and appending the short name of each decomposition component in order. (See Section 4.4, Jamo Short Names—Normative.) For example, for U+D4DB, derive the decomposition, as shown in the preceding example. It produces the following three-character sequence: U+1111    U+1171    U+11B6   - The character name for U+D4DB is then generated as   . This character name is a normative property of the character.

3.12 Bidirectional Behavior The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, in several scripts (such as Arabic or Hebrew), the natural ordering of horizontal text in display is from right to left. If all of the text has the same horizontal direction, then the ordering of the display text is unambiguous. However, when bidirectional text (a mix- ture of left-to-right and right-to-left horizontal text) is present, some ambiguities can arise in determining the ordering of the displayed characters.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 55 3.12 Bidirectional Behavior Conformance

This section describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit format codes for special circumstances. In most cases, there is no need to include additional information with the text to obtain cor- rect display ordering. In the case of bidirectional text, there are circumstances where an implicit bidirectional ordering will not suffice to produce comprehensible text. To deal with these cases, a mini- mal set of directional formatting codes is defined to control the ordering of characters when rendered. This allows for exact control of the display ordering for legible interchange and also ensures that plain text used for simple items like file names or labels can always be correctly ordered for display. The directional formatting codes are used only to influence the display ordering of text. In all other respects, they should be ignored—they have no effect on the comparison of text, word breaks, parsing, or numeric analysis. When working with bidirectional text, the characters are still interpreted in logical order— only the display is affected. The display ordering of bidirectional text depends upon the directional properties of the characters in the text.

Directional Formatting Codes Two types of explicit codes are used to modify the standard implicit Unicode bidirectional algorithm. In addition, there are implicit ordering codes, the right-to-left and left-to-right marks. All of these codes are limited to the current paragraph; thus their effects are termi- nated by a paragraph separator. The directional types left-to-right and right-to-left are called strong types, and characters of those types are called strong directional characters. The directional types associated with numbers are called weak types, and characters of those types are called weak directional characters. Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside the embedding, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm. Explicit Directional Embedding. The following codes signal that a piece of text is to be treated as embedded. For example, an English quotation in the middle of an Arabic sen- tence could be marked as being embedded left-to-right text. If a Hebrew phrase occurred in the middle of the English quotation, then that phrase could be marked as being embedded right-to-left. These codes allow for nested embeddings. RLE Right-to-Left Embedding Treat the following text as embedded right- to-left. LRE Left-to-Right Embedding Treat the following text as embedded left-to- right. The precise meaning of these codes will be made clear in the discussion of the algorithm. The effect of right-left line direction, for example, can be accomplished by simply embed- ding the text with RLE...PDF. Explicit Directional Overrides. The following codes allow the bidirectional character types to be overridden when required for special cases, such as for part numbers. These codes allow for nested directional overrides.

56 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

RLO Right-to-Left Override Force following characters to be treated as strong right-to-left characters. LRO Left-to-Right Override Force following characters to be treated as strong left-to-right characters. The precise meanings of these codes will be made clear in the discussion of the algorithm. The right-to-left override, for example, can be used to force a part number made of mixed English, digits, and Hebrew letters to be written from right to left. Terminating Explicit Directional Code. The following code terminates the effects of the last explicit code (either embedding or override) and restores the bidirectional state to what it was before that code was encountered. PDF Pop Directional Format Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO. Implicit Directional Marks. These characters are very lightweight codes. They act exactly like right-to-left or left-to-right characters, except that they are not displayed and do not have any other semantic effect. Their use is generally more convenient than the explicit embeddings or overrides because their scope is much more local. RLM Right-to-Left Mark Right-to-left zero-width character LRM Left-to-Right Mark Left-to-right zero-width character There is no special mention of the implicit directional marks in the following algorithm. That omission occurs because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.

Basic Display Algorithm The bidirectional algorithm takes a stream of text as input and proceeds in three main phases: • Separation of the input text into paragraphs. The rest of the algorithm affects only the text between paragraph separators. • Resolution of the embedding levels of the text. In this phase, the directional charac- ter types, plus the explicit format codes, are used to produce resolved embedding levels. • Reordering the text for display on a line-by-line basis using the resolved embedding levels, once the text has been broken into lines. The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by U+2029 -   or the appropriate Newline Function (see Section 4.3, Directionality— Normative and Unicode Technical Report #13, “Unicode Newline Guidelines,” found on the CD-ROM or the up-to-date version on the Unicode Web site for information on the handling of CR, LF, and CRLF). Paragraphs may also be determined by higher-level proto- cols; for example, the text in two different cells of a table will be in different paragraphs. Combining characters are always attached to the preceding base character in the memory representation. Even after reordering for display and performing character shaping, the glyph representing a combining character will be attached to the glyph representing its base character in memory. Depending on the line orientation and the placement direction of base glyphs, it may, for example, become attached to the glyph on the left, or on the right, or above.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 57 3.12 Bidirectional Behavior Conformance

In the following subsections, the normative definitions and rules are distinguished by the numbering given in Table 3-6. Table 3-6. Normative Definitions and Rules

Numbering Section BDn Definitions Pn Paragraph levels Xn Explicit levels and directions Wn Weak types Nn Neutral types In Implicit levels Ln Resolved levels

Definitions BD1 The bidirectional character types are values assigned to each Unicode character, including unassigned characters. BD2 Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61. • Embedding levels are explicitly set by both override format codes and embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guar- antee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings. BD3 The default direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd. • For example, in a particular piece of text, level 0 is plain English text, level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embed- ding level will become clear when the reordering algorithm is discussed, but this section provides an example of how the algorithm works. BD4 The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph. BD5 The direction of the paragraph embedding level is called the paragraph direction. • In some contexts, the paragraph direction is also known as the base direction. BD6 The directional override status determines whether the bidirectional type of charac- ters is reset with explicit directional controls. This status has three states, shown in Table 3-7. BD7 A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level.

58 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

Table 3-7. Directional Override Status

Status Interpretation Neutral No override is currently active Right-to-left Characters are to be reset to R Left-to-right Characters are to be reset to L

Example. In this and the following examples, case is used to indicate different implicit character types for those unfamiliar with right-to-left letters. Uppercase letters stand for right-to-left characters (such as Arabic or Hebrew), whereas lowercase letters stand for left- to-right characters (such as English or Russian). Memory: car is THE CAR in arabic Character types: LLL-LL-RRR-RRR-LL-LLLLLL Resolved levels: 000000011111110000000000 Notice that the neutral character (space) between THE and CAR gets the level of the sur- rounding characters. In this way, the implicit directional marks have an effect. By inserting appropriate directional marks around neutral characters, the level of the neutral characters can be changed. Bidirectional Character Types. The normative bidirectional character types for each char- acter are specified in the Unicode Character Database and summarized in Table 3-8. Table 3-8. Bidirectional Character Types

Category Type Description Scope L Left-to-Right LRM, most alphabetic, syllabic, Han ideo- graphic characters, digits that are neither European nor Arabic, all unassigned charac- ters except in the ranges (0590–05FF, FB1D– FB4F) and (0600–07BF, FB50–FDFF, FE70– FEFF) LRE Left-to-Right Embedding LRE LRO Left-to-Right Override LRO Strong R Right-to-Left RLM, , most punctuation specific to that script, all unassigned charac- ters in the ranges (0590–05FF, FB1D–FB4F) AL Right-to-Left Arabic Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, all unassigned characters in the ranges (0600– 07BF, FB50–FDFF, FE70–FEFF) RLE Right-to-Left Embedding RLE RLO Right-to-Left Override RLO

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 59 3.12 Bidirectional Behavior Conformance

Table 3-8. Bidirectional Character Types (Continued)

Category Type Description Scope PDF Pop Directional Format PDF EN European Number European digits, Eastern Arabic-Indic digits, ... European Number Separator Solidus (slash) ET European Number Terminator Plus sign, minus sign, degree, currency sym- bols, ... AN Arabic Number Arabic-Indic digits, Arabic decimal and thou- Weak sands separators, ... Common Number Separator , comma, (period), - , ... NSM Nonspacing Mark Characters marked Mn (nonspacing mark) and Me (enclosing mark) in the Unicode Character Database BN Boundary Neutral Formatting and control characters, other than those explicitly given types above (to be ignored in processing bidirectional text) B Paragraph Separator Paragraph separator, appropriate newline functions, higher-protocol paragraph deter- mination S Segment Separator Tab Neutral WS Whitespace Space, figure space, line separator, form feed, general punctuation spaces, ... ON Other Neutrals All other characters, including   

• The term European digits is used to refer to decimal forms common in Europe and elsewhere, and the term Arabic-Indic digits refers to the native Arabic forms. (See Section 8.2, Arabic, for more details on naming digits.) • Unassigned characters are given strong types in the algorithm. This convention is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirec- tional types may change. • Private-use characters can be assigned different values by a conformant implemen- tation. • For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are an    (U+FFFC). Table 3-9 lists additional abbreviations used in the examples and internal character types used in the algorithm. Table 3-9. Abbreviations for Examples and Internal Types

Symbol Description N Neutral or separator (B, S, WS, ON). e The text ordering type (L or R) that matches the embedding level direction (even or odd) for the current run. Cf. X10 below. sor The text ordering type (L or R) assigned to the position before a level run. eor The text ordering type (L or R) assigned to the position after a level run.

60 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

Resolving Embedding Levels The body of the bidirectional algorithm uses character types and explicit codes to produce a list of resolved levels. This resolution process consists of five steps: (1) determining the paragraph level; (2) determining explicit embedding levels and directions; (3) resolving weak types; (4) resolving neutral types; and (5) resolving implicit embedding levels. Paragraph Level. The bidirectional algorithm applies to paragraphs. P1. Split the text into separate paragraphs. A paragraph separator is kept with the previous paragraph. Within each paragraph, apply all other rules of this algorithm. P2. In each paragraph, find the first character that is a strong directional type (L, AL, R). Because paragraph separators delimit text in this algorithm, they will generally be the first strong character after a paragraph separator or at the very beginning of the text. P3. If a character is found in rule P2 and it is of type AL or R, then set the paragraph embedding level to 1; otherwise, set it to zero. Note that when a higher-level protocol specifies the paragraph level, it is not necessary to apply rules P2 and P3. Explicit Levels and Directions. All explicit embedding levels are determined from the embedding and override codes, by applying the explicit level rules X1 through X9. These rules are applied as part of the same logical pass over the input. Explicit Embeddings. An explicit embedding code sets the level of the text, but does not change the directional character type of affected characters. X1. Begin by setting the current embedding level to the paragraph embedding level. Set the directional override status to neutral. Process each character iteratively, applying rules X2 through X9. Only embedding levels from 0 to 61 are valid in this phase. In the resolution of levels in rules I1 and I2, the maximum embedding level of 62 can be reached. X2. With each RLE, compute the least greater odd embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. For example, level 0 ò 1; levels 1, 2 ò 3; levels 3, 4 ò 5; ...59, 60 ò 61; above 60, no change (don’t change levels with RLE if the new level would be invalid). X3. With each LRE, compute the least greater even embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. For example, levels 0, 1 ò 2; levels 2, 3 ò 4; levels 4, 5 ò 6; ...58, 59 ò 60; above 59, no change (don’t change levels with LRE if the new level would be invalid). Explicit Overrides. An explicit directional override sets the embedding level in the same way that the explicit embedding codes do, but also changes the directional character type of affected characters to the override direction.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 61 3.12 Bidirectional Behavior Conformance

X4. With each RLO, compute the least greater odd embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to right-to-left. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. X5. With each LRO, compute the least greater even embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to left-to-right. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. X6. For all types besides RLE, LRE, RLO, LRO, and PDF: a. Set the level of the current character to the current embedding level. b. Whenever the directional override status is not neutral, reset the current character type to the directional override status. If the directional override status is neutral, then characters retain their normal types: Ara- bic characters stay AL, Latin characters stay L, neutrals stay N, and so on. If the directional override status is R, then characters become R. If the directional override status is L, then characters become L. Terminating Embeddings and Overrides. A single code terminates the scope of the current explicit code, whether an embedding or a directional override. All codes and pushed states are completely popped at the end of paragraphs. X7. With each PDF, determine the matching embedding or override code. If a valid match- ing code was found, restore (pop) the last remembered (pushed) embedding level and directional override. X8. All explicit directional embeddings and overrides are completely terminated at the end of each paragraph. Paragraph separators are not included in the embedding. X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes. • Note that an implementation does not have to actually remove the codes, it just has to behave as though the codes were not present for the remainder of the algorithm. Conformance does not require any particular placement of these codes as long as all other characters are ordered correctly. See “Implementation Notes” later in this section for information on implementing the algorithm without removing the formatting codes. X10. The remaining rules are applied to each run of characters at the same level. For each run, determine the start-of-level-run (sor) and end-of-level-run (eor) type, either L or R. This determination depends on the higher of the two levels on either side of the boundary (at the start or end of the paragraph, the level of the “other” run is the base embedding level). If the higher level is odd, the type is R; otherwise, it is L. For example: Levels: 000111 2 Runs: Ä 1 ÅÄ2 Å <3>

62 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

Run 1 is at level 0, sor is L, eor is R. Run 2 is at level 1, sor is R, eor is L. Run 3 is at level 2, sor is L, eor is L. For two adjacent runs, the eor of the first run is the same as the sor of the second. Resolving Weak Types. Weak types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used. Nonspacing marks are now resolved based on the previous characters. W1. Examine each nonspacing mark (NSM) in the level run, and change the type of the NSM to the type of the previous character. If the NSM is at the start of the level run, it will get the type of sor. Assume in this example that sor is R: AL NSM NSM ò AL AL AL sor NSM ò sor R The text is next parsed for numbers. This pass will change the directional types European Number Separator, European Number Terminator, and Common Number Separator to be European Number text, Arabic Number text, or Other Neutral text. The text to be scanned may have already had its type altered by directional overrides. If so, then it will not parse as numeric. W2. Search backward from each instance of a European number until the first strong type (R, L, AL, or sor) is found. If an AL is found, change the type of the European number to Arabic number. AL EN ò AL AN AL N EN ò AL N AN sor N EN ò sor N EN L N EN ò L N EN R N EN ò R N EN W3. Change all ALs to R. W4. A single European separator between two European numbers changes to a European number. A single common separator between two numbers of the same type changes to that type: EN ES EN ò EN EN EN EN CS EN ò EN EN EN AN CS AN ò AN AN AN W5. A sequence of European terminators adjacent to European numbers changes to all Euro- pean numbers: ET ET EN ò EN EN EN EN ET ET ò EN EN EN AN ET EN ò AN EN EN W6. Otherwise, separators and terminators change to Other Neutral: AN ET ò AN ON L ES EN ò L ON EN EN CS AN ò EN ON AN ET AN ò ON AN

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 63 3.12 Bidirectional Behavior Conformance

W7. Search backward from each instance of a European number until the first strong type (R, L, or sor) is found. If an L is found, then change the type of the European number to L. L N EN ò L N L R N EN ò R N EN Resolving Neutral Types. Neutral types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used. The next phase resolves the direction of the neutrals. The results of this phase are that all neutrals become either R or L. Generally, neutrals take on the direction of the surrounding text. In case of a conflict, they take on the embedding direction. N1. A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers are treated as though they were R. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries. R N R ò R R R L N L ò L L L R N AN ò R R AN AN N R ò AN R R R N EN ò R R EN EN N R ò EN R R EN N AN ò EN R AN • Any European numbers after an L have already been turned into L by this time, so only the European numbers after R are treated as R. N2. Any remaining neutrals take the embedding direction. N ò e Assume in this example that eor is L, and sor is R: L N eor ò L L eor R N eor ò R e eor sor N L ò sor e L sor N R ò sor R R Examples. A list of numbers separated by neutrals and embedded in a directional run will come out in the run’s order. Storage: he said “THE VALUES ARE 123, 456, 789, OK”. Display: he said “ ,789 ,456 ,123 ERA SEULAV EHT”. In this case, both the comma and the space between the numbers take on the direction of the surrounding text (uppercase = right-to-left), ignoring the numbers. The are not considered part of the number because they are not surrounded on both sides. If an adjacent left-to-right sequence is present, then European numbers will adopt that direc- tion. Storage: he said “IT IS A bmw 500, OK.” Display: he said “.KO ,bmw 500 A SI TI” Resolving Implicit Levels. In the final phase, the embedding level of text may be increased, based upon the resolved character type. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. (Note that it is

64 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior possible for text to end up at levels higher than 61 as a result of this process.) This situation results in the following rules: I1. For all characters with an even (left-to-right) embedding direction, those of type R go up one level and those of type AN or EN go up two levels. I2. For all characters with an odd (right-to-left) embedding direction, those of type L, EN, or AN go up one level. Table 3-10 summarizes the results of the implicit algorithm. Table 3-10. Resolving Implicit Levels

Embedding Level Type Even Odd LELEL+1 REL+1EL AN +2 EL+1 EN EL+2 EL+1

Reordering Resolved Levels The following algorithm describes the logical process of finding the correct display order. As noted earlier, this logical process is not necessarily the actual implementation, which may diverge for efficiency as long as it produces the same results. As opposed to resolution phases, this algorithm acts on a per-line basis, and is applied after any line wrapping is applied to the paragraph. The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see Section 8.2, Arabic). Logically, there are the following steps: • The levels of the text are determined according to the bidirectional algorithm. • The characters are shaped into glyphs according to their context (taking the embed- ding levels into account for mirroring!). • The accumulated widths of those glyphs (in logical order) are used to determine line breaks. • For each line, rules L1–L4 are used to reorder the characters on that line. • The glyphs corresponding to the characters on the line are displayed in that order. L1. On each line, reset the embedding level of the following characters to the paragraph embedding level: 1. segment separators, 2. paragraph separators, 3. any sequence of whitespace characters preceding a segment separator or paragraph separator, and 4. any sequence of whitespace characters at the end of the line. • The types of characters used here are the original types, not those modified by the previous phase.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 65 3.12 Bidirectional Behavior Conformance

• Because a paragraph separator breaks lines, there will be at most one per line, at the end of that line. In combination with the following rule, this requirement means that trailing whitespace will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph. L2. From the highest level found in the text to the lowest odd level on each line, reverse any contiguous sequence of characters that is at that level or higher. This process reverses a progressively larger series of substrings. The following four exam- ples illustrate this rule. Memory: car means CAR. Resolved levels: 00000000001110 Reverse level 1: car means RAC. Memory: car MEANS CAR. Resolved levels: 22211111111111 Reverse level 2: rac MEANS CAR. Reverse levels 1, 2: .RAC SNAEM car Memory: he said “car MEANS CAR.” Resolved levels: 000000000222111111111100 Reverse level 2: he said “rac MEANS CAR.” Reverse levels 1, 2: he said “RAC SNAEM car.” Memory: DID YOU SAY ‘he said “car MEANS CAR”’? Resolved levels: 11111111111112222222224443333333333211 Reverse level 4: DID YOU SAY ‘he said “rac MEANS CAR”’? Reverse levels 3, 4: DID YOU SAY ‘he said “RAC SNAEM car”’? Reverse levels 2–4: DID YOU SAY ’”rac MEANS CAR“ dias eh‘? Reverse levels 1–4: ?‘he said “RAC SNAEM car”’ YAS UOY DID L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed. Many font designers provide default metrics for combining marks that support rendering by simple overhang. Because of the reordering for right-to-left characters, it is common practice to make the glyphs for most combining characters overhang to the left (thereby assuming the characters will be applied to left-to-right base characters) and make the glyphs for combining characters in right-to-left scripts overhang to the right (thereby assuming that the characters will be applied to right-to-left base characters). With such fonts, the display ordering of the marks and base glyphs may need to be adjusted when combining marks are applied to “unmatching” base characters. See Section 5.14, Rendering Nonspacing Marks, for more information. L4. A character that possesses the mirrored property as specified by Section 4.7, Mirrored— Normative, must be depicted by a mirrored glyph if the resolved directionality of that character is R. For example, U+0028  —which is interpreted in the Unicode Standard as an opening parenthesis—appears as “(” when its resolved level is even, and as the mirrored glyph “)” when its resolved level is odd.

66 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

Bidirectional Conformance The bidirectional algorithm specifies part of the intrinsic semantics of right-to-left charac- ters. In the absence of a higher-level protocol that specifically supersedes the interpretation of directionality, systems that interpret these characters must achieve results identical to the implicit bidirectional algorithm when rendering. Boundary Neutrals. The goal in marking a format or control character as BN is to ensure that it has no effect on the rest of the algorithm. Because the precise ordering of format characters with respect to others is not required for conformance, implementations are free to handle them in different ways for efficiency as long as the ordering of the other charac- ters is preserved. Explicit Formatting Codes. As with any Unicode characters, systems do not have to sup- port any particular explicit directional formatting code (although it is not generally useful to include a terminating code without including the initiator). Generally, conforming sys- tems will fall into three classes: • No bidirectional formatting. This choice implies that the system does not visually interpret characters from right-to-left scripts. • Implicit bidirectionality. The implicit bidirectional algorithm and the directional marks RLM and LRM are supported. • Full bidirectionality. The implicit bidirectional algorithm, the implicit directional marks, and the explicit directional embedding codes are supported: RLM, LRM, LRE, RLE, LRO, RLO, PDF. Higher-Level Protocols. The following are permissible ways for systems to apply higher- level protocols to the ordering of bidirectional text: • Override the paragraph embedding level. A higher-level protocol should provide for overriding the paragraph embedding level, such as on a table cell, paragraph, docu- ment, or system level. • Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to con- clude that the document was fundamentally Arabic and that EN should generally be converted to AN. • Replace, supplement, or override the directional overrides or embedding codes. This task is accomplished by providing information via additional stylesheet or markup information about the embedding level or character direction. The interpretation of such information must always be defined by reference to the behavior of the equiva- lent explicit codes as given in the algorithm. • Override the bidirectional character types assigned to control codes to match the inter- pretation of the control codes within the protocol. (See also Section 13.1, Control Codes.) • Remap the number shapes to match those of another set. For example, remap the Ara- bic number shapes to have the same appearance as the European numbers. When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol or that (as in the last example) the appropriate characters can be substituted.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 67 3.12 Bidirectional Behavior Conformance

Implementation Notes Reference Implementations. Source code for two reference implementations of the bidi- rectional algorithm is provided as part of Unicode Technical Report #9, “The Bidirectional Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site. Imple- menters are encouraged to use this resource to test their implementations. Retaining Format Codes. Some implementations may wish to retain the format codes when running the algorithm. This goal may be accomplished in the following manner: • In rule X9, instead of removing the format codes, set all format codes to BN. For each sequence of BN codes, set the levels of the codes to the level of the preceding non-BN code (or if the sequence is at the beginning of the paragraph, to the - graph level). • In rule W1, search backward from each NSM to the first character in the level run whose type is not BN, and set the NSM to its type. • In rule W4, scan past BN types that are adjacent to ES or CS. • In rule W5, change all appropriate sequences of ET and BN, not just ET. • In rule W6, change all BN types to ON as well. • In rule L1, include format codes and BN together with whitespace characters in the sequences whose levels are reset before a separator or line break. Implementations that display visible representations of format characters will want to adjust this mechanism so as to position the format characters optimally for editing. Vertical Text. In the case of vertical line orientation, the bidirectional algorithm is still used to determine the levels of the text. However, these levels are not used to reorder the text, as the characters are usually ordered uniformly from top to bottom. Instead, the levels are used to determine the rotation of the text. Sometimes vertical lines follow a vertical base- line in which each character is oriented as normal (with no rotation), with characters ordered from top to bottom whether they are Hebrew, numbers, or Latin. When setting text using the Arabic script in vertical lines, it is more common to employ a horizontal baseline that is rotated by 90 degrees counterclockwise so that the characters are ordered from top to bottom. Latin text and numbers may be rotated 90 degrees clockwise so that the characters become ordered from top to bottom as well. The bidirectional algorithm also comes into play when some characters are ordered from bottom to top. For example, this situation arises with a mixture of Arabic and Latin glyphs when all of the glyphs are rotated uniformly 90 degrees clockwise. (The choice of whether text is to be presented horizontally or vertically, or whether text is to be rotated, is not spec- ified by the Unicode Standard, and is left to higher-level protocols.) Usage. Because of the implicit character types and the heuristics for resolving neutral and numeric directional behavior, the implicit bidirectional ordering will generally produce the correct display without any further work. However, problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or nested segments of different-direction text are present, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the correct display. Part numbers may also require directional overrides. The most common problematic case involves neutrals on the boundary of an embedded language. This issue can be addressed by setting the level of the embedded text correctly. For example, with all text at level 0, the following occurs:

68 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior

Memory: he said "I NEED WATER!", and expired. Display: he said "RETAW DEEN I!", and expired. If the is to be part of the Arabic quotation, then the user can select the text I NEED WATER! and explicitly mark it as embedded Arabic, which produces the fol- lowing result: Memory: he said "I NEED WATER!", and expired. Display: he said "!RETAW DEEN I", and expired. A simpler method is to place a right directional mark (RLM) after the exclamation mark. Because the exclamation mark is no longer on a directional boundary, this produces the correct result. Memory: he said "I NEED WATER!", and expired. Display: he said "!RETAW DEEN I", and expired. This latter approach is preferred because it does not use the stateful format codes, which can easily get out of sync if not fully supported by editors and other string manipulation. The stateful format codes are generally needed only for more complex (and rare) cases such as double embeddings, as in the following: Memory: DID YOU SAY ‘he said "I NEED WATER!", and expired.’? Display: ?‘he said "!RETAW DEEN I", and expired.’ YAS UOY DID Migrating from 2.0 to 3.0. In the Unicode Character Database for Version 3.0, new bidirec- tional character types are introduced to make the body of the algorithm depend only on the types of characters, and not on the character values. The changes from the Version 2.0 bidi- rectional types are listed in Table 3-11. Table 3-11. New Bidirectional Types

Characters New Bidirectional Type All characters with General Category Me, Mn NSM All characters of type R in the Arabic ranges (0600..06FF, AL FB50..FDFF, FE70..FEFE) (Letters in the Thaana and Syriac ranges also have this value.) The explicit embedding characters: LRO, RLO, LRE, RLE, PDF LRO, RLO, LRE, RLE, PDF, respectively Formatting characters and controls (General Category Cf and BN Cc) that were of bidirectional type ON Zero Width Space BN

Implementations that use older property tables can be adjusted to the modifications in the bidirectional algorithm by algorithmically remapping the characters listed in Table 3-11 to the new types.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 69 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 4 Character Properties 4

Disclaimer The content of all character property tables has been verified as far as possible by the Unicode Consortium. However, the Unicode Consortium does not guarantee that the tables printed in this volume or on the CD-ROM are correct in every detail, and it is not responsible for errors that may occur either in the character property tables or in software that implements these tables. The contents of all the tables in this chapter may be superseded or augmented by information on the Uni- code Web site.

This chapter describes the attributes of character properties defined by the Unicode Stan- dard and gives mappings of characters to specific character properties. Full listings for all Unicode properties are provided in the Unicode Character Database on the CD-ROM. While the Unicode Consortium strives to minimize changes to character property data, occasionally character properties must be updated. When this situation occurs, the relevant data files of the Unicode Character Database are revised. The revised data files are posted on the Unicode Web site as an update version of the standard. Normative and Informative Properties As specified in Chapter 3, Conformance, the Unicode Standard defines both normative and informative properties. Normative Properties. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant. The term normative when applied to a character property does not mean that the value of the prop- erty will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes. Informative Properties. If a character property is only informative, a conformant imple- mentation is free to use or change such values as it may require, while still remaining con- formant to the standard. Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a pro- tocol to convey that information. The normative character properties of the Unicode Standard are given in Table 4-1, and the informative character properties are given in Table 4-2. Also included in the tables are the locations of each property’s description and of the list of characters with that property.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 73 Character Properties

Table 4-1. Normative Character Properties

Property Description List Case Chapter 4 Chapter 14a Combining Classes Chapter 3 Ta ble 4 -3 Conjoining Jamo (1100–11FF) Chapter 3 Chapter 14 Decomposition (Canonical and Compatibility) Chapter 3 Chapter 14 Directionality Section 3.12 CD-ROM Jamo Short Name Chapter 3 Ta ble 4 -4 Numeric Value Chapter 4 Ta ble 4 -6 Private Use Chapter 3 Section 13.5 Special Character Properties Chapter 13 Section 3.9 Surrogate Chapter 3 Section 13.4 Mirrored Chapter 3 Ta ble 4 -9 Unicode Character Names Chapter 14 Chapter 14 a. This property is listed by individual character where it is not obvious. Table 4-2. Informative Character Properties

Property Description List Case Mapping Chapter 4 CD-ROM Chapter 6 Ta ble 6 -2 East Asian Width Section 10.3 CD-ROM UTR #11 Letters (Alphabetic and Ideographic) Chapter 4 CD-ROM Line Breaking Chapter 13 Section 13.1 and Section 13.2 Mathematical Property Chapter 4 Section 4.9 Spaces Chapter 6 Ta ble 6 -1 Unicode 1.0 Names Chapter 4 CD-ROM

Default Properties. Implementations need specific properties for all code points, including those that are unassigned. To meet this need, the Unicode standard assigns default proper- ties to ranges of unassigned code points. For example, see Table 3-8. Consistency of Properties. The Unicode Standard is the product of many compromises. It has to strike a balance between uniformity of treatment for similar characters and compat- ibility with existing practice for characters inherited from legacy encodings. Because of this balancing act, one can expect a certain number of anomalies in character properties. For example, some pairs of characters might have been treated as canonical equivalents but are left unequivalent for compatibility with legacy differences. This situation pertains to U+00B5 š   (cf. U+03BC ¡    ) as well as to certain Korean jamo. In addition, some characters might have had properties differing in some ways from those assigned in this standard, but whose properties are left as is for compatibility with existing practice. This situation can be seen with the halfwidth voicing marks for Japanese (U+FF9E      and U+FF9F  -  -  ), which might have been better analyzed as spacing com- bining marks, and with the conjoining Hangul jamo, which might have been better analyzed as an initial base character, followed by formally combining medial and final char- acters. In the interest of efficiency and uniformity in algorithms, implementations may take advantage of such reanalyses of character properties, as long as the results they produce do not overtly conflict with those specified by the normative properties of this standard.

74 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.1 Case—Normative

4.1 Case—Normative Case is a normative property of characters in certain alphabets whereby characters are con- sidered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lower- case letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Because of the inclusion of certain composite characters for compatibility, such as U+01F1    , a third case, called titlecase, is used where the first character of a word must be capitalized. An example of such a character is U+01F2        . The three case forms are UPPERCASE, Titlecase, lowercase. For those scripts that have case (Latin, Greek, Cyrillic, Armenian, and archaic Georgian), the case of a Unicode character can usually be obtained from the character’s name. This statement is true for only these five scripts. Uppercase characters typically contain the word capital in their names. Lowercase characters typically contain the word small. The word small in the names of characters from scripts other than the five just listed has nothing to do with case. (Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian. See Section 7.5, Georgian.) Case Mappings. The lowercase letter default case mapping occurs between the small char- acter and the capital character. The Unicode Standard case mapping tables, which are informative, are on the CD-ROM. Exceptions to the normal casing rules can be found in the data file SpecialCasing.txt. For more information on case mappings, see Section 5.18, Case Mappings, and Unicode Technical Report #21, “Case Mappings,” on the CD-ROM or the up-to-date version on the Unicode Web site.

4.2 Combining Classes—Normative Each combining character has a normative canonical combining class. This class is used with the canonical ordering algorithm to determine which combining characters interact typographically and to determine how the canonical ordering of sequences of combining characters takes place. Class zero combining characters act like base letters for the purpose of determining canonical order. Combining characters with non-zero classes participate in reordering for the purpose of determining the canonical form of sequences of characters. (See Section 3.10, Canonical Ordering Behavior, for a description of the algorithm.) The list of combining characters and their canonical combining class appears in Table 4-3. Most combining characters are nonspacing. The spacing, class zero, combining characters are so noted. Table 4-3. Combining Classes Spacing and Nonspacing (Class 0) U+07A6 0 THAANA ABAFILI U+07A7 0 THAANA AABAAFILI U+07A8 0 THAANA IBIFILI U+07A9 0 THAANA EEBEEFILI U+07AA 0 THAANA UBUFILI U+07AB 0 THAANA OOBOOFILI U+07AC 0 THAANA EBEFILI U+07AD 0 THAANA EYBEYFILI U+07AE 0 THAANA OBOFILI U+07AF 0 THAANA OABOAFILI

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 75 4.2 Combining Classes—Normative Character Properties

Table 4-3. Combining Classes (Continued) U+07B0 0 THAANA SUKUN U+0901 0 DEVANAGARI SIGN CANDRABINDU U+0902 0 DEVANAGARI SIGN U+0903 0 DEVANAGARI SIGN U+093E 0 DEVANAGARI VOWEL SIGN AA U+0940 0 DEVANAGARI VOWEL SIGN II U+0941 0 DEVANAGARI VOWEL SIGN U U+0942 0 DEVANAGARI VOWEL SIGN UU U+0943 0 DEVANAGARI VOWEL SIGN VOCALIC R U+0944 0 DEVANAGARI VOWEL SIGN VOCALIC RR U+0945 0 DEVANAGARI VOWEL SIGN CANDRA E U+0946 0 DEVANAGARI VOWEL SIGN SHORT E U+0947 0 DEVANAGARI VOWEL SIGN E U+0948 0 DEVANAGARI VOWEL SIGN U+0949 0 DEVANAGARI VOWEL SIGN CANDRA O U+094A 0 DEVANAGARI VOWEL SIGN SHORT O U+094B 0 DEVANAGARI VOWEL SIGN O U+094C 0 DEVANAGARI VOWEL SIGN U+0962 0 DEVANAGARI VOWEL SIGN VOCALIC L U+0963 0 DEVANAGARI VOWEL SIGN VOCALIC LL U+0981 0 BENGALI SIGN CANDRABINDU U+0982 0 BENGALI SIGN ANUSVARA U+0983 0 BENGALI SIGN VISARGA U+09BE 0 BENGALI VOWEL SIGN AA U+09C0 0 BENGALI VOWEL SIGN II U+09C1 0 BENGALI VOWEL SIGN U U+09C2 0 BENGALI VOWEL SIGN UU U+09C3 0 BENGALI VOWEL SIGN VOCALIC R U+09C4 0 BENGALI VOWEL SIGN VOCALIC RR U+09D7 0 BENGALI AU LENGTH MARK U+09E2 0 BENGALI VOWEL SIGN VOCALIC L U+09E3 0 BENGALI VOWEL SIGN VOCALIC LL U+0A02 0 GURMUKHI SIGN BINDI U+0A3E 0 GURMUKHI VOWEL SIGN AA U+0A40 0 GURMUKHI VOWEL SIGN II U+0A41 0 GURMUKHI VOWEL SIGN U U+0A42 0 GURMUKHI VOWEL SIGN UU U+0A47 0 GURMUKHI VOWEL SIGN EE U+0A48 0 GURMUKHI VOWEL SIGN AI U+0A4B 0 GURMUKHI VOWEL SIGN OO U+0A4C 0 GURMUKHI VOWEL SIGN AU U+0A70 0 GURMUKHI TIPPI U+0A71 0 GURMUKHI ADDAK U+0A81 0 GUJARATI SIGN CANDRABINDU U+0A82 0 GUJARATI SIGN ANUSVARA U+0A83 0 GUJARATI SIGN VISARGA U+0ABE 0 GUJARATI VOWEL SIGN AA U+0AC0 0 GUJARATI VOWEL SIGN II U+0AC1 0 GUJARATI VOWEL SIGN U U+0AC2 0 GUJARATI VOWEL SIGN UU U+0AC3 0 GUJARATI VOWEL SIGN VOCALIC R U+0AC4 0 GUJARATI VOWEL SIGN VOCALIC RR U+0AC5 0 GUJARATI VOWEL SIGN CANDRA E U+0AC7 0 GUJARATI VOWEL SIGN E U+0AC8 0 GUJARATI VOWEL SIGN AI U+0AC9 0 GUJARATI VOWEL SIGN CANDRA O U+0ACB 0 GUJARATI VOWEL SIGN O U+0ACC 0 GUJARATI VOWEL SIGN AU U+0B01 0 ORIYA SIGN CANDRABINDU U+0B02 0 ORIYA SIGN ANUSVARA U+0B03 0 ORIYA SIGN VISARGA U+0B3E 0 ORIYA VOWEL SIGN AA

76 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative

Table 4-3. Combining Classes (Continued) U+0B3F 0 ORIYA VOWEL SIGN I U+0B40 0 ORIYA VOWEL SIGN II U+0B41 0 ORIYA VOWEL SIGN U U+0B42 0 ORIYA VOWEL SIGN UU U+0B43 0 ORIYA VOWEL SIGN VOCALIC R U+0B56 0 ORIYA AI LENGTH MARK U+0B57 0 ORIYA AU LENGTH MARK U+0B82 0 TAMIL SIGN ANUSVARA U+0B83 0 TAMIL SIGN VISARGA U+0BBE 0 TAMIL VOWEL SIGN AA U+0BBF 0 TAMIL VOWEL SIGN I U+0BC0 0 TAMIL VOWEL SIGN II U+0BC1 0 TAMIL VOWEL SIGN U U+0BC2 0 TAMIL VOWEL SIGN UU U+0BD7 0 TAMIL AU LENGTH MARK U+0C01 0 TELUGU SIGN CANDRABINDU U+0C02 0 TELUGU SIGN ANUSVARA U+0C03 0 TELUGU SIGN VISARGA U+0C3E 0 TELUGU VOWEL SIGN AA U+0C3F 0 TELUGU VOWEL SIGN I U+0C40 0 TELUGU VOWEL SIGN II U+0C41 0 TELUGU VOWEL SIGN U U+0C42 0 TELUGU VOWEL SIGN UU U+0C43 0 TELUGU VOWEL SIGN VOCALIC R U+0C44 0 TELUGU VOWEL SIGN VOCALIC RR U+0C46 0 TELUGU VOWEL SIGN E U+0C47 0 TELUGU VOWEL SIGN EE U+0C48 0 TELUGU VOWEL SIGN AI U+0C4A 0 TELUGU VOWEL SIGN O U+0C4B 0 TELUGU VOWEL SIGN OO U+0C4C 0 TELUGU VOWEL SIGN AU U+0C82 0 KANNADA SIGN ANUSVARA U+0C83 0 KANNADA SIGN VISARGA U+0CBE 0 KANNADA VOWEL SIGN AA U+0CBF 0 KANNADA VOWEL SIGN I U+0CC1 0 KANNADA VOWEL SIGN U U+0CC2 0 KANNADA VOWEL SIGN UU U+0CC3 0 KANNADA VOWEL SIGN VOCALIC R U+0CC4 0 KANNADA VOWEL SIGN VOCALIC RR U+0CC6 0 KANNADA VOWEL SIGN E U+0CCC 0 KANNADA VOWEL SIGN AU U+0CD5 0 KANNADA LENGTH MARK U+0CD6 0 KANNADA AI LENGTH MARK U+0D02 0 MALAYALAM SIGN ANUSVARA U+0D03 0 MALAYALAM SIGN VISARGA U+0D3E 0 MALAYALAM VOWEL SIGN AA U+0D3F 0 MALAYALAM VOWEL SIGN I U+0D40 0 MALAYALAM VOWEL SIGN II U+0D41 0 MALAYALAM VOWEL SIGN U U+0D42 0 MALAYALAM VOWEL SIGN UU U+0D43 0 MALAYALAM VOWEL SIGN VOCALIC R U+0D57 0 MALAYALAM AU LENGTH MARK U+0D82 0 SINHALA SIGN ANUSVARAYA U+0D83 0 SINHALA SIGN VISARGAYA U+0DCF 0 SINHALA VOWEL SIGN AELA-PILLA U+0DD0 0 SINHALA VOWEL SIGN KETTI AEDA-PILLA U+0DD1 0 SINHALA VOWEL SIGN DIGA AEDA-PILLA U+0DD2 0 SINHALA VOWEL SIGN KETTI IS-PILLA U+0DD3 0 SINHALA VOWEL SIGN DIGA IS-PILLA U+0DD4 0 SINHALA VOWEL SIGN KETTI PAA-PILLA U+0DD6 0 SINHALA VOWEL SIGN DIGA PAA-PILLA U+0DD8 0 SINHALA VOWEL SIGN GAETTA-PILLA

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 77 4.2 Combining Classes—Normative Character Properties

Table 4-3. Combining Classes (Continued) U+0DDF 0 SINHALA VOWEL SIGN GAYANUKITTA U+0DF2 0 SINHALA VOWEL SIGN DIGA GAETTA-PILLA U+0DF3 0 SINHALA VOWEL SIGN DIGA GAYANUKITTA U+0E31 0 THAI CHARACTER MAI HAN-AKAT U+0E34 0 THAI CHARACTER SARA I U+0E35 0 THAI CHARACTER SARA II U+0E36 0 THAI CHARACTER SARA U+0E37 0 THAI CHARACTER SARA UEE U+0E47 0 THAI CHARACTER MAITAIKHU U+0E4C 0 THAI CHARACTER THANTHAKHAT U+0E4D 0 THAI CHARACTER NIKHAHIT U+0E4E 0 THAI CHARACTER YAMAKKAN U+0EB1 0 LAO VOWEL SIGN MAI KAN U+0EB4 0 LAO VOWEL SIGN I U+0EB5 0 LAO VOWEL SIGN II U+0EB6 0 LAO VOWEL SIGN Y U+0EB7 0 LAO VOWEL SIGN YY U+0EBB 0 LAO VOWEL SIGN MAI KON U+0EBC 0 LAO SIGN LO U+0ECC 0 LAO CANCELLATION MARK U+0ECD 0 LAO NIGGAHITA U+0F3E 0 TIBETAN SIGN YAR TSHES U+0F3F 0 TIBETAN SIGN MAR TSHES U+0F73 0 TIBETAN VOWEL SIGN II U+0F75 0 TIBETAN VOWEL SIGN UU U+0F76 0 TIBETAN VOWEL SIGN VOCALIC R U+0F77 0 TIBETAN VOWEL SIGN VOCALIC RR U+0F78 0 TIBETAN VOWEL SIGN VOCALIC L U+0F79 0 TIBETAN VOWEL SIGN VOCALIC LL U+0F7E 0 TIBETAN SIGN RJES U+0F7F 0 TIBETAN SIGN RNAM BCAD U+0F81 0 TIBETAN VOWEL SIGN REVERSED II U+102C 0 MYANMAR VOWEL SIGN AA U+102D 0 MYANMAR VOWEL SIGN I U+102E 0 MYANMAR VOWEL SIGN II U+102F 0 MYANMAR VOWEL SIGN U U+1030 0 MYANMAR VOWEL SIGN UU U+1032 0 MYANMAR VOWEL SIGN AI U+1036 0 MYANMAR SIGN ANUSVARA U+1038 0 MYANMAR SIGN VISARGA U+1056 0 MYANMAR VOWEL SIGN VOCALIC R U+1057 0 MYANMAR VOWEL SIGN VOCALIC RR U+1058 0 MYANMAR VOWEL SIGN VOCALIC L U+1059 0 MYANMAR VOWEL SIGN VOCALIC LL U+17B4 0 KHMER VOWEL INHERENT AQ U+17B5 0 KHMER VOWEL INHERENT AA U+17B6 0 KHMER VOWEL SIGN AA U+17B7 0 KHMER VOWEL SIGN I U+17B8 0 KHMER VOWEL SIGN II U+17B9 0 KHMER VOWEL SIGN Y U+17BA 0 KHMER VOWEL SIGN YY U+17BB 0 KHMER VOWEL SIGN U U+17BC 0 KHMER VOWEL SIGN UU U+17BD 0 KHMER VOWEL SIGN UA U+17C6 0 KHMER SIGN NIKAHIT U+17C7 0 KHMER SIGN REAHMUK U+17C8 0 KHMER SIGN YUUKALEAPINTU U+17C9 0 KHMER SIGN MUUSIKATOAN U+17CA 0 KHMER SIGN TRIISAP U+17CB 0 KHMER SIGN BANTOC U+17CC 0 KHMER SIGN ROBAT U+17CD 0 KHMER SIGN TOANDAKHIAT

78 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative

Table 4-3. Combining Classes (Continued) U+17CE 0 KHMER SIGN KAKABAT U+17CF 0 KHMER SIGN AHSDA U+17D0 0 KHMER SIGN SAMYOK SANNYA U+17D1 0 KHMER SIGN VIRIAM U+17D3 0 KHMER SIGN BATHAMASAT Split (Class 0) U+09CB 0 BENGALI VOWEL SIGN O U+09CC 0 BENGALI VOWEL SIGN AU U+0B48 0 ORIYA VOWEL SIGN AI U+0B4B 0 ORIYA VOWEL SIGN O U+0B4C 0 ORIYA VOWEL SIGN AU U+0BCA 0 TAMIL VOWEL SIGN O U+0BCB 0 TAMIL VOWEL SIGN OO U+0BCC 0 TAMIL VOWEL SIGN AU U+0CC0 0 KANNADA VOWEL SIGN II U+0CC7 0 KANNADA VOWEL SIGN EE U+0CC8 0 KANNADA VOWEL SIGN AI U+0CCA 0 KANNADA VOWEL SIGN O U+0CCB 0 KANNADA VOWEL SIGN OO U+0D4A 0 MALAYALAM VOWEL SIGN O U+0D4B 0 MALAYALAM VOWEL SIGN OO U+0D4C 0 MALAYALAM VOWEL SIGN AU U+0DDC 0 SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA U+0DDD 0 SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA U+0DDE 0 SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA U+17BF 0 KHMER VOWEL SIGN U+17C0 0 KHMER VOWEL SIGN U+17C4 0 KHMER VOWEL SIGN OO U+17C5 0 KHMER VOWEL SIGN AU Reordrant (Class 0) U+093F 0 DEVANAGARI VOWEL SIGN I U+09BF 0 BENGALI VOWEL SIGN I U+09C7 0 BENGALI VOWEL SIGN E U+09C8 0 BENGALI VOWEL SIGN AI U+0A3F 0 GURMUKHI VOWEL SIGN I U+0ABF 0 GUJARATI VOWEL SIGN I U+0B47 0 ORIYA VOWEL SIGN E U+0BC6 0 TAMIL VOWEL SIGN E U+0BC7 0 TAMIL VOWEL SIGN EE U+0BC8 0 TAMIL VOWEL SIGN AI U+0D46 0 MALAYALAM VOWEL SIGN E U+0D47 0 MALAYALAM VOWEL SIGN EE U+0D48 0 MALAYALAM VOWEL SIGN AI U+0DD9 0 SINHALA VOWEL SIGN KOMBUVA U+0DDA 0 SINHALA VOWEL SIGN DIGA KOMBUVA U+0DDB 0 SINHALA VOWEL SIGN KOMBU DEKA U+1031 0 MYANMAR VOWEL SIGN E U+17BE 0 KHMER VOWEL SIGN OE U+17C1 0 KHMER VOWEL SIGN E U+17C2 0 KHMER VOWEL SIGN U+17C3 0 KHMER VOWEL SIGN AI Tibetan Subjoined (Class 0) Letters U+0F90 0 TIBETAN SUBJOINED LETTER U+0F91 0 TIBETAN SUBJOINED LETTER U+0F92 0 TIBETAN SUBJOINED LETTER GA U+0F93 0 TIBETAN SUBJOINED LETTER U+0F94 0 TIBETAN SUBJOINED LETTER NGA U+0F95 0 TIBETAN SUBJOINED LETTER CA U+0F96 0 TIBETAN SUBJOINED LETTER U+0F97 0 TIBETAN SUBJOINED LETTER JA

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 79 4.2 Combining Classes—Normative Character Properties

Table 4-3. Combining Classes (Continued) U+0F99 0 TIBETAN SUBJOINED LETTER U+0F9A 0 TIBETAN SUBJOINED LETTER TTA U+0F9B 0 TIBETAN SUBJOINED LETTER TTHA U+0F9C 0 TIBETAN SUBJOINED LETTER DDA U+0F9D 0 TIBETAN SUBJOINED LETTER DDHA U+0F9E 0 TIBETAN SUBJOINED LETTER NNA U+0F9F 0 TIBETAN SUBJOINED LETTER TA U+0FA0 0 TIBETAN SUBJOINED LETTER U+0FA1 0 TIBETAN SUBJOINED LETTER DA U+0FA2 0 TIBETAN SUBJOINED LETTER U+0FA3 0 TIBETAN SUBJOINED LETTER U+0FA4 0 TIBETAN SUBJOINED LETTER U+0FA5 0 TIBETAN SUBJOINED LETTER U+0FA6 0 TIBETAN SUBJOINED LETTER U+0FA7 0 TIBETAN SUBJOINED LETTER U+0FA8 0 TIBETAN SUBJOINED LETTER U+0FA9 0 TIBETAN SUBJOINED LETTER TSA U+0FAA 0 TIBETAN SUBJOINED LETTER TSHA U+0FAB 0 TIBETAN SUBJOINED LETTER DZA U+0FAC 0 TIBETAN SUBJOINED LETTER DZHA U+0FAD 0 TIBETAN SUBJOINED LETTER WA U+0FAE 0 TIBETAN SUBJOINED LETTER ZHA U+0FAF 0 TIBETAN SUBJOINED LETTER ZA U+0FB0 0 TIBETAN SUBJOINED LETTER -A U+0FB1 0 TIBETAN SUBJOINED LETTER YA U+0FB2 0 TIBETAN SUBJOINED LETTER U+0FB3 0 TIBETAN SUBJOINED LETTER U+0FB4 0 TIBETAN SUBJOINED LETTER U+0FB5 0 TIBETAN SUBJOINED LETTER SSA U+0FB6 0 TIBETAN SUBJOINED LETTER U+0FB7 0 TIBETAN SUBJOINED LETTER U+0FB8 0 TIBETAN SUBJOINED LETTER A U+0FB9 0 TIBETAN SUBJOINED LETTER KSSA U+0FBA 0 TIBETAN SUBJOINED LETTER FIXED-FORM WA U+0FBB 0 TIBETAN SUBJOINED LETTER FIXED-FORM YA U+0FBC 0 TIBETAN SUBJOINED LETTER FIXED-FORM RA Enclosing (Class 0) U+0488 0 CYRILLIC HUNDRED THOUSANDS SIGN U+0489 0 CYRILLIC MILLIONS SIGN U+06DD 0 ARABIC END OF AYAH U+06DE 0 ARABIC START OF RUB EL HIZB U+20DD 0 COMBINING ENCLOSING CIRCLE U+20DE 0 COMBINING ENCLOSING SQUARE U+20DF 0 COMBINING ENCLOSING DIAMOND U+20E0 0 COMBINING ENCLOSING CIRCLE U+20E2 0 COMBINING ENCLOSING SCREEN U+20E3 0 COMBINING ENCLOSING KEYCAP Overlays/Interior (Class 1) U+0334 1 COMBINING TILDE OVERLAY U+0335 1 COMBINING SHORT STROKE OVERLAY U+0336 1 COMBINING LONG STROKE OVERLAY U+0337 1 COMBINING SHORT SOLIDUS OVERLAY U+0338 1 COMBINING LONG SOLIDUS OVERLAY U+20D2 1 COMBINING LONG VERTICAL LINE OVERLAY U+20D3 1 COMBINING SHORT VERTICAL LINE OVERLAY U+20D8 1 COMBINING RING OVERLAY U+20D9 1 COMBINING CLOCKWISE RING OVERLAY U+20DA 1 COMBINING ANTICLOCKWISE RING OVERLAY Nuktas (Class 7) U+093C 7 DEVANAGARI SIGN NUKTA U+09BC 7 BENGALI SIGN NUKTA

80 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative

Table 4-3. Combining Classes (Continued) U+0A3C 7 GURMUKHI SIGN NUKTA U+0ABC 7 GUJARATI SIGN NUKTA U+0B3C 7 ORIYA SIGN NUKTA U+1037 7 MYANMAR SIGN DOT BELOW Kana Voicing Marks (Class 8) U+3099 8 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+309A 8 COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK (Class 9) U+094D 9 DEVANAGARI SIGN VIRAMA U+09CD 9 BENGALI SIGN VIRAMA U+0A4D 9 GURMUKHI SIGN VIRAMA U+0ACD 9 GUJARATI SIGN VIRAMA U+0B4D 9 ORIYA SIGN VIRAMA U+0BCD 9 TAMIL SIGN VIRAMA U+0C4D 9 TELUGU SIGN VIRAMA U+0CCD 9 KANNADA SIGN VIRAMA U+0D4D 9 MALAYALAM SIGN VIRAMA U+0DCA 9 SINHALA SIGN AL-LAKUNA U+0E3A 9 THAI CHARACTER PHINTHU U+0F84 9 TIBETAN MARK HALANTA U+1039 9 MYANMAR SIGN VIRAMA U+17D2 9 KHMER SIGN COENG Fixed Position Classes (Classes 10..199) U+05B0 10 HEBREW POINT SHEVA U+05B1 11 HEBREW POINT HATAF U+05B2 12 HEBREW POINT HATAF PATAH U+05B3 13 HEBREW POINT HATAF QAMATS U+05B4 14 HEBREW POINT U+05B5 15 HEBREW POINT TSERE U+05B6 16 HEBREW POINT SEGOL U+05B7 17 HEBREW POINT PATAH U+05B8 18 HEBREW POINT QAMATS U+05B9 19 HEBREW POINT U+05BB 20 HEBREW POINT QUBUTS U+05BC 21 HEBREW POINT OR MAPIQ U+05BD 22 HEBREW POINT U+05BF 23 HEBREW POINT U+05C1 24 HEBREW POINT DOT U+05C2 25 HEBREW POINT SIN DOT U+FB1E 26 HEBREW POINT JUDEO-SPANISH VARIKA U+064B 27 ARABIC FATHATAN U+064C 28 ARABIC DAMMATAN U+064D 29 ARABIC KASRATAN U+064E 30 ARABIC FATHA U+064F 31 ARABIC DAMMA U+0650 32 ARABIC KASRA U+0651 33 ARABIC SHADDA U+0652 34 ARABIC SUKUN U+0670 35 ARABIC LETTER SUPERSCRIPT ALEF U+0711 36 SYRIAC LETTER SUPERSCRIPT ALAPH U+0C55 84 TELUGU LENGTH MARK U+0C56 91 TELUGU AI LENGTH MARK U+0E38 103 THAI CHARACTER SARA U U+0E39 103 THAI CHARACTER SARA UU U+0E48 107 THAI CHARACTER MAI EK U+0E49 107 THAI CHARACTER MAI THO U+0E4A 107 THAI CHARACTER MAI TRI U+0E4B 107 THAI CHARACTER MAI CHATTAWA U+0EB8 118 LAO VOWEL SIGN U U+0EB9 118 LAO VOWEL SIGN UU

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 81 4.2 Combining Classes—Normative Character Properties

Table 4-3. Combining Classes (Continued) U+0EC8 122 LAO TONE MAI EK U+0EC9 122 LAO TONE MAI THO U+0ECA 122 LAO TONE MAI TI U+0ECB 122 LAO TONE MAI CATAWA U+0F71 129 TIBETAN VOWEL SIGN AA U+0F72 130 TIBETAN VOWEL SIGN I U+0F7A 130 TIBETAN VOWEL SIGN E U+0F7B 130 TIBETAN VOWEL SIGN EE U+0F7C 130 TIBETAN VOWEL SIGN O U+0F7D 130 TIBETAN VOWEL SIGN OO U+0F80 130 TIBETAN VOWEL SIGN REVERSED I U+0F74 132 TIBETAN VOWEL SIGN U Below Left Attached (Class 200) Below Attached (Class 202) U+0321 202 COMBINING PALATALIZED BELOW U+0322 202 COMBINING RETROFLEX HOOK BELOW U+0327 202 COMBINING U+0328 202 COMBINING Below Right Attached (Class 204) Left Attached Reordrant (Class 208) Right Attached (Class 210) Above Left Attached (Class 212) Above Attached (Class 214) Above Right Attached (Class 216) U+031B 216 COMBINING U+0F39 216 TIBETAN MARK TSA -PHRU Below Left (Class 218) U+302A 218 IDEOGRAPHIC LEVEL TONE MARK Below (Class 220) U+0316 220 COMBINING GRAVE ACCENT BELOW U+0317 220 COMBINING ACUTE ACCENT BELOW U+0318 220 COMBINING LEFT TACK BELOW U+0319 220 COMBINING RIGHT TACK BELOW U+031C 220 COMBINING LEFT HALF RING BELOW U+031D 220 COMBINING BELOW U+031E 220 COMBINING DOWN TACK BELOW U+031F 220 COMBINING PLUS SIGN BELOW U+0320 220 COMBINING MINUS SIGN BELOW U+0323 220 COMBINING DOT BELOW U+0324 220 COMBINING DIAERESIS BELOW U+0325 220 COMBINING RING BELOW U+0326 220 COMBINING COMMA BELOW U+0329 220 COMBINING VERTICAL LINE BELOW U+032A 220 COMBINING BRIDGE BELOW U+032B 220 COMBINING INVERTED DOUBLE ARCH BELOW U+032C 220 COMBINING BELOW U+032D 220 COMBINING CIRCUMFLEX ACCENT BELOW U+032E 220 COMBINING BREVE BELOW U+032F 220 COMBINING BELOW U+0330 220 COMBINING TILDE BELOW U+0331 220 COMBINING BELOW U+0332 220 COMBINING LOW LINE U+0333 220 COMBINING DOUBLE LOW LINE U+0339 220 COMBINING RIGHT HALF RING BELOW U+033A 220 COMBINING INVERTED BRIDGE BELOW U+033B 220 COMBINING SQUARE BELOW U+033C 220 COMBINING SEAGULL BELOW

82 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative

Table 4-3. Combining Classes (Continued) U+0347 220 COMBINING BELOW U+0348 220 COMBINING DOUBLE VERTICAL LINE BELOW U+0349 220 COMBINING LEFT ANGLE BELOW U+034D 220 COMBINING LEFT RIGHT ARROW BELOW U+034E 220 COMBINING UPWARDS ARROW BELOW U+0591 220 HEBREW ACCENT ETNAHTA U+0596 220 HEBREW ACCENT TIPEHA U+059B 220 HEBREW ACCENT TEVIR U+05A3 220 HEBREW ACCENT MUNAH U+05A4 220 HEBREW ACCENT MAHAPAKH U+05A5 220 HEBREW ACCENT MERKHA U+05A6 220 HEBREW ACCENT MERKHA KEFULA U+05A7 220 HEBREW ACCENT DARGA U+05AA 220 HEBREW ACCENT YERAH BEN YOMO U+0655 220 ARABIC BELOW U+06E3 220 ARABIC SMALL LOW SEEN U+06EA 220 ARABIC EMPTY CENTRE LOW STOP U+06ED 220 ARABIC SMALL LOW MEEM U+0731 220 SYRIAC PTHAHA BELOW U+0734 220 SYRIAC ZQAPHA BELOW U+0737 220 SYRIAC RBASA BELOW U+0738 220 SYRIAC DOTTED ZLAMA HORIZONTAL U+0739 220 SYRIAC DOTTED ZLAMA ANGULAR U+073B 220 SYRIAC HBASA BELOW U+073C 220 SYRIAC HBASA/ESASA DOTTED U+073E 220 SYRIAC ESASA BELOW U+0742 220 SYRIAC RUKKAKHA U+0744 220 SYRIAC TWO VERTICAL DOTS BELOW U+0746 220 SYRIAC THREE DOTS BELOW U+0748 220 SYRIAC OBLIQUE LINE BELOW U+0952 220 DEVANAGARI STRESS SIGN ANUDATTA U+0F18 220 TIBETAN ASTROLOGICAL SIGN -KHYUD PA U+0F19 220 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS U+0F35 220 TIBETAN MARK NGAS BZUNG NYI ZLA U+0F37 220 TIBETAN MARK NGAS BZUNG SGOR RTAGS U+0FC6 220 TIBETAN SYMBOL PADMA GDAN Below Right (Class 222) U+059A 222 HEBREW ACCENT YETIV U+05AD 222 HEBREW ACCENT DEHI U+302D 222 IDEOGRAPHIC ENTERING TONE MARK Left (Class 224) U+302E 224 HANGUL SINGLE DOT TONE MARK U+302F 224 HANGUL DOUBLE DOT TONE MARK Right (Class 226) Above Left (Class 228) U+05AE 228 HEBREW ACCENT ZINOR U+18A9 228 MONGOLIAN LETTER AG DAGALGA U+302B 228 IDEOGRAPHIC RISING TONE MARK Above (Class 230) U+0300 230 COMBINING GRAVE ACCENT U+0301 230 COMBINING ACUTE ACCENT U+0302 230 COMBINING CIRCUMFLEX ACCENT U+0303 230 COMBINING TILDE U+0304 230 COMBINING MACRON U+0305 230 COMBINING U+0306 230 COMBINING BREVE U+0307 230 COMBINING DOT ABOVE U+0308 230 COMBINING DIAERESIS U+0309 230 COMBINING U+030A 230 COMBINING RING ABOVE

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 83 4.2 Combining Classes—Normative Character Properties

Table 4-3. Combining Classes (Continued) U+030B 230 COMBINING U+030C 230 COMBINING CARON U+030D 230 COMBINING VERTICAL LINE ABOVE U+030E 230 COMBINING DOUBLE VERTICAL LINE ABOVE U+030F 230 COMBINING U+0310 230 COMBINING CANDRABINDU U+0311 230 COMBINING INVERTED BREVE U+0312 230 COMBINING TURNED COMMA ABOVE U+0313 230 COMBINING COMMA ABOVE U+0314 230 COMBINING REVERSED COMMA ABOVE U+033D 230 COMBINING X ABOVE U+033E 230 COMBINING VERTICAL TILDE U+033F 230 COMBINING DOUBLE OVERLINE U+0340 230 COMBINING GRAVE TONE MARK U+0341 230 COMBINING ACUTE TONE MARK U+0342 230 COMBINING GREEK PERISPOMENI U+0343 230 COMBINING GREEK KORONIS U+0344 230 COMBINING GREEK DIALYTIKA TONOS U+0346 230 COMBINING BRIDGE ABOVE U+034A 230 COMBINING NOT TILDE ABOVE U+034B 230 COMBINING HOMOTHETIC ABOVE U+034C 230 COMBINING ALMOST EQUAL TO ABOVE U+0483 230 COMBINING CYRILLIC U+0484 230 COMBINING CYRILLIC PALATALIZATION U+0485 230 COMBINING CYRILLIC DASIA PNEUMATA U+0486 230 COMBINING CYRILLIC PSILI PNEUMATA U+0592 230 HEBREW ACCENT SEGOL U+0593 230 HEBREW ACCENT SHALSHELET U+0594 230 HEBREW ACCENT ZAQEF QATAN U+0595 230 HEBREW ACCENT ZAQEF GADOL U+0597 230 HEBREW ACCENT REVIA U+0598 230 HEBREW ACCENT ZARQA U+0599 230 HEBREW ACCENT PASHTA U+059C 230 HEBREW ACCENT U+059D 230 HEBREW ACCENT GERESH MUQDAM U+059E 230 HEBREW ACCENT U+059F 230 HEBREW ACCENT QARNEY PARA U+05A0 230 HEBREW ACCENT TELISHA GEDOLA U+05A1 230 HEBREW ACCENT PAZER U+05A8 230 HEBREW ACCENT QADMA U+05A9 230 HEBREW ACCENT TELISHA QETANA U+05AB 230 HEBREW ACCENT OLE U+05AC 230 HEBREW ACCENT ILUY U+05AF 230 HEBREW MARK MASORA CIRCLE U+05C4 230 HEBREW MARK UPPER DOT U+0653 230 ARABIC MADDAH ABOVE U+0654 230 ARABIC HAMZA ABOVE U+06D6 230 ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA U+06D7 230 ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA U+06D8 230 ARABIC SMALL HIGH MEEM INITIAL FORM U+06D9 230 ARABIC SMALL HIGH LAM ALEF U+06DA 230 ARABIC SMALL HIGH JEEM U+06DB 230 ARABIC SMALL HIGH THREE DOTS U+06DC 230 ARABIC SMALL HIGH SEEN U+06DF 230 ARABIC SMALL HIGH ROUNDED ZERO U+06E0 230 ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO U+06E1 230 ARABIC SMALL HIGH DOTLESS HEAD OF KHAH U+06E2 230 ARABIC SMALL HIGH MEEM ISOLATED FORM U+06E4 230 ARABIC SMALL HIGH MADDA U+06E7 230 ARABIC SMALL HIGH YEH

84 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.3 Directionality—Normative

Table 4-3. Combining Classes (Continued) U+06E8 230 ARABIC SMALL HIGH NOON U+06EB 230 ARABIC EMPTY CENTRE HIGH STOP U+06EC 230 ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE U+0730 230 SYRIAC PTHAHA ABOVE U+0732 230 SYRIAC PTHAHA DOTTED U+0733 230 SYRIAC ZQAPHA ABOVE U+0735 230 SYRIAC ZQAPHA DOTTED U+0736 230 SYRIAC RBASA ABOVE U+073A 230 SYRIAC HBASA ABOVE U+073D 230 SYRIAC ESASA ABOVE U+073F 230 SYRIAC RWAHA U+0740 230 SYRIAC FEMININE DOT U+0741 230 SYRIAC QUSHSHAYA U+0743 230 SYRIAC TWO VERTICAL DOTS ABOVE U+0745 230 SYRIAC THREE DOTS ABOVE U+0747 230 SYRIAC OBLIQUE LINE ABOVE U+0749 230 SYRIAC MUSIC U+074A 230 SYRIAC BARREKH U+0951 230 DEVANAGARI STRESS SIGN UDATTA U+0953 230 DEVANAGARI GRAVE ACCENT U+0954 230 DEVANAGARI ACUTE ACCENT U+0F82 230 TIBETAN SIGN NYI ZLA NAA DA U+0F83 230 TIBETAN SIGN SNA LDAN U+0F86 230 TIBETAN SIGN LCI RTAGS U+0F87 230 TIBETAN SIGN YANG RTAGS U+20D0 230 COMBINING LEFT HARPOON ABOVE U+20D1 230 COMBINING RIGHT HARPOON ABOVE U+20D4 230 COMBINING ANTICLOCKWISE ARROW ABOVE U+20D5 230 COMBINING CLOCKWISE ARROW ABOVE U+20D6 230 COMBINING LEFT ARROW ABOVE U+20D7 230 COMBINING RIGHT ARROW ABOVE U+20DB 230 COMBINING THREE DOTS ABOVE U+20DC 230 COMBINING FOUR DOTS ABOVE U+20E1 230 COMBINING LEFT RIGHT ARROW ABOVE U+FE20 230 COMBINING LIGATURE LEFT HALF U+FE21 230 COMBINING LIGATURE RIGHT HALF U+FE22 230 COMBINING DOUBLE TILDE LEFT HALF U+FE23 230 COMBINING DOUBLE TILDE RIGHT HALF Above Right (Class 232) U+0315 232 COMBINING COMMA ABOVE RIGHT U+031A 232 COMBINING LEFT ANGLE ABOVE U+302C 232 IDEOGRAPHIC DEPARTING TONE MARK Double Below (Class 233) U+0362 233 COMBINING DOUBLE RIGHTWARDS ARROW BELOW Double Above (Class 234) U+0360 234 COMBINING DOUBLE TILDE U+0361 234 COMBINING DOUBLE INVERTED BREVE Subscript (Class 240) U+0345 240 COMBINING GREEK YPOGEGRAMMENI

4.3 Directionality—Normative Directional behavior is interpreted according to the Unicode bidirectional algorithm (see Section 3.12, Bidirectional Behavior). For this purpose, all characters of the Unicode Stan- dard possess a normative directional type. The directional types left-to-right and right-to- left are called strong types, and characters of these types are called strong directional charac- ters. Left-to-right types include most alphabetic and syllabic characters, as well as all Han

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 85 4.4 Jamo Short Names—Normative Character Properties ideographic characters. Right-to-left types include Arabic, Hebrew, Syriac, and Thaana, and most punctuation specific to those scripts. In addition, the Unicode bidirectional algo- rithm also uses weak types and neutrals. For the directional types of Unicode characters, see the Unicode Character Database on the CD-ROM.

4.4 Jamo Short Names—Normative The jamo short name is a normative property of the Unicode conjoining Hangul jamo char- acters. These short names, which are listed in Table 4-4, are used to determine the character names that are derived when decomposing Hangul syllables into their decomposition sequence. Table 4-4. Jamo Short Names

Short Value Glyph Unicode Name Name U+1100 G þ HANGUL CHOSEONG KIYEOK U+1101 GG € HANGUL CHOSEONG SSANGKIYEOK U+1102 N ó HANGUL CHOSEONG NIEUN U+1103 D ì HANGUL CHOSEONG TIKEUT U+1104 DD ö HANGUL CHOSEONG SSANGTIKEUT U+1105 R ú HANGUL CHOSEONG RIEUL U+1106 M ÷ HANGUL CHOSEONG MIEUM U+1107 B ø HANGUL CHOSEONG PIEUP U+1108 BB í HANGUL CHOSEONG SSANGPIEUP U+1109 S û HANGUL CHOSEONG SIOS U+110A ç HANGUL CHOSEONG SSANGSIOS U+110B ü HANGUL CHOSEONG IEUNG U+110C J å HANGUL CHOSEONG CIEUC U+110D JJ  HANGUL CHOSEONG SSANGCIEUC U+110E C ê HANGUL CHOSEONG CHIEUCH U+110F K ‚ HANGUL CHOSEONG KHIEUKH U+1110 T ƒ HANGUL CHOSEONG THIEUTH U+1111 P ñ HANGUL CHOSEONG PHIEUPH U+1112 H ò HANGUL CHOSEONG HIEUH U+1161 A Æ HANGUL JUNGSEONG A U+1162 AE Ç HANGUL JUNGSEONG AE U+1163 YA È HANGUL JUNGSEONG YA U+1164 É HANGUL JUNGSEONG YAE U+1165 EO Ê HANGUL JUNGSEONG EO U+1166 E Ë HANGUL JUNGSEONG E U+1167 YEO Ì HANGUL JUNGSEONG YEO U+1168 Í HANGUL JUNGSEONG YE U+1169 O Î HANGUL JUNGSEONG O U+116A WA Ï HANGUL JUNGSEONG WA U+116B WAE Ð HANGUL JUNGSEONG WAE U+116C OE Ñ HANGUL JUNGSEONG OE U+116D Ò HANGUL JUNGSEONG YO U+116E U Ó HANGUL JUNGSEONG U U+116F WEO Ô HANGUL JUNGSEONG WEO U+1170 WE Õ HANGUL JUNGSEONG WE

86 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.5 General Category—Normative in Part

Table 4-4. Jamo Short Names (Continued)

Short Value Glyph Unicode Name Name U+1171 Ö HANGUL JUNGSEONG WI U+1172 YU × HANGUL JUNGSEONG YU U+1173 EU Ø HANGUL JUNGSEONG EU U+1174 YI Ù HANGUL JUNGSEONG YI U+1175 I Ú HANGUL JUNGSEONG I U+11A8 G  HANGUL JONGSEONG KIYEOK U+11A9 GG Ž HANGUL JONGSEONG SSANGKIYEOK U+11AA GS  HANGUL JONGSEONG KIYEOK-SIOS U+11AB N  HANGUL JONGSEONG NIEUN U+11AC ‘ HANGUL JONGSEONG NIEUN-CIEUC U+11AD ’ HANGUL JONGSEONG NIEUN-HIEUH U+11AE D “ HANGUL JONGSEONG TIKEUT U+11AF L ” HANGUL JONGSEONG RIEUL U+11B0 LG • HANGUL JONGSEONG RIEUL-KIYEOK U+11B1 LM – HANGUL JONGSEONG RIEUL-MIEUM U+11B2 LB — HANGUL JONGSEONG RIEUL-PIEUP U+11B3 ˜ HANGUL JONGSEONG RIEUL-SIOS U+11B4 LT ™ HANGUL JONGSEONG RIEUL-THIEUTH U+11B5 LP š HANGUL JONGSEONG RIEUL-PHIEUPH U+11B6 LH › HANGUL JONGSEONG RIEUL-HIEUH U+11B7 M œ HANGUL JONGSEONG MIEUM U+11B8 B  HANGUL JONGSEONG PIEUP U+11B9 BS ž HANGUL JONGSEONG PIEUP-SIOS U+11BA S Ÿ HANGUL JONGSEONG SIOS U+11BB SS HANGUL JONGSEONG SSANGSIOS U+11BC NG ¡ HANGUL JONGSEONG IEUNG U+11BD J ¢ HANGUL JONGSEONG CIEUC U+11BE C £ HANGUL JONGSEONG CHIEUCH U+11BF K ¤ HANGUL JONGSEONG KHIEUKH U+11C0 T ¥ HANGUL JONGSEONG THIEUTH U+11C1 P ¦ HANGUL JONGSEONG PHIEUPH U+11C2 H § HANGUL JONGSEONG HIEUH

4.5 General Category—Normative in Part The Unicode Character Database on the CD-ROM defines a General Category for all Uni- code characters. This General Category constitutes a partition of the characters into several major classes, such as letters, punctuation, and symbols, and further subclasses for each of the major classes. Each Unicode character is assigned a General Category value. Each value of the General Category is defined as a two-letter abbreviation, where the first letter gives information about a major class and the second letter designates a subclass of that major class. In each class, the subclass “other” merely collects the remaining characters of the major class. For example, the subclass “No” (Number, other) includes all characters of the Number class that are not a decimal digit or letter. These characters may have little in common besides their membership in the same major class.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 87 4.5 General Category—Normative in Part Character Properties

Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. The General Category is unique, in that some of the values are correlated directly with other normative properties of Unicode characters (numeric status, combining status, or some special properties). Other values are simply informative, provided as an aid to com- mon behavior between Unicode implementations. Table 4-5 enumerates the values of Gen- eral Category, with a short description of each value; it is divided between normative and informative values. Table 4-5. General Category Normative Lu = Letter, uppercase Ll = Letter, lowercase Lt = Letter, titlecase

Mn = Mark, nonspacing Mc = Mark, spacing combining Me = Mark, enclosing

Nd = Number, decimal digit Nl = Number, letter No = Number, other

Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph

Cc = Other, control Cf = Other, format Cs = Other, surrogate Co = Other, private use Cn = Other, not assigned

Informative Lm = Letter, modifier Lo = Letter, other

Pc = Punctuation, connector Pd = Punctuation, dash = Punctuation, open = Punctuation, close = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other

Sm = Symbol, math Sc = Symbol, currency Sk = Symbol, modifier So = Symbol, other

A common use of the General Category of a Unicode character is to assist in determination of boundaries in text, as in Section 5.15, Locating Text Element Boundaries. Another com- mon use is in determining language identifiers for programming, scripting, and markup, as in Section 5.16, Identifiers. This property is also used to support common APIs such as isLetter(), isUppercase(), and so on.

88 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative

4.6 Numeric Value—Normative Numeric value is a normative property of characters that represent numbers. This group includes characters such as fractions, subscripts, superscripts, Roman numerals, currency numerators, encircled numbers, and script-specific digits. In many traditional numbering systems, letters are used with a numeric value. Examples include Greek and Hebrew letters as well as Latin letters used in outlines (II.A.1.b). These special cases are not included here as numbers. Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal- numbers. They include script-specific digits, not characters such as Roman numerals (1 + 5 = 15 = fifteen, but I + V = IV = four), subscripts, or super- scripts. Numbers other than decimal digits can be used in numerical expressions, but it is up to the user to determine the specialized uses. The Unicode Standard assigns distinct codes to the forms of digits that are specific to a given script or language. Examples are the digits used with the Arabic script, Chinese num- bers, or those of the Indic languages. For naming conventions, see the introduction to Sec- tion 8.2, Arabic. Table 4-6 gives the numeric values of Unicode characters that can represent numbers. Some CJK ideographs also have numeric values; those are not included in Table 4-6, but are dis- cussed following the table. Table 4-6. Numeric Properties Value Decimal Number Name U+0030 á 0 DIGIT ZERO U+0031 á 1 DIGIT ONE U+0032 á 2 DIGIT TWO U+0033 á 3 DIGIT THREE U+0034 á 4 DIGIT FOUR U+0035 á 5 DIGIT FIVE U+0036 á 6 DIGIT SIX U+0037 á 7 DIGIT SEVEN U+0038 á 8 DIGIT EIGHT U+0039 á 9 DIGIT NINE U+00B2 2 SUPERSCRIPT TWO U+00B3 3 SUPERSCRIPT THREE U+00B9 1 SUPERSCRIPT ONE U+00BC 1/4 VULGAR FRACTION ONE QUARTER U+00BD 1/2 VULGAR FRACTION ONE HALF U+00BE 3/4 VULGAR FRACTION THREE QUARTERS U+0660 á 0 ARABIC-INDIC DIGIT ZERO U+0661 á 1 ARABIC-INDIC DIGIT ONE U+0662 á 2 ARABIC-INDIC DIGIT TWO U+0663 á 3 ARABIC-INDIC DIGIT THREE U+0664 á 4 ARABIC-INDIC DIGIT FOUR U+0665 á 5 ARABIC-INDIC DIGIT FIVE U+0666 á 6 ARABIC-INDIC DIGIT SIX U+0667 á 7 ARABIC-INDIC DIGIT SEVEN U+0668 á 8 ARABIC-INDIC DIGIT EIGHT U+0669 á 9 ARABIC-INDIC DIGIT NINE U+06F0 á 0 EXTENDED ARABIC-INDIC DIGIT ZERO U+06F1 á 1 EXTENDED ARABIC-INDIC DIGIT ONE U+06F2 á 2 EXTENDED ARABIC-INDIC DIGIT TWO U+06F3 á 3 EXTENDED ARABIC-INDIC DIGIT THREE

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 89 4.6 Numeric Value—Normative Character Properties

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+06F4 á 4 EXTENDED ARABIC-INDIC DIGIT FOUR U+06F5 á 5 EXTENDED ARABIC-INDIC DIGIT FIVE U+06F6 á 6 EXTENDED ARABIC-INDIC DIGIT SIX U+06F7 á 7 EXTENDED ARABIC-INDIC DIGIT SEVEN U+06F8 á 8 EXTENDED ARABIC-INDIC DIGIT EIGHT U+06F9 á 9 EXTENDED ARABIC-INDIC DIGIT NINE U+0966 á 0 DEVANAGARI DIGIT ZERO U+0967 á 1 DEVANAGARI DIGIT ONE U+0968 á 2 DEVANAGARI DIGIT TWO U+0969 á 3 DEVANAGARI DIGIT THREE U+096A á 4 DEVANAGARI DIGIT FOUR U+096B á 5 DEVANAGARI DIGIT FIVE U+096C á 6 DEVANAGARI DIGIT SIX U+096D á 7 DEVANAGARI DIGIT SEVEN U+096E á 8 DEVANAGARI DIGIT EIGHT U+096F á 9 DEVANAGARI DIGIT NINE U+09E6 á 0 BENGALI DIGIT ZERO U+09E7 á 1 BENGALI DIGIT ONE U+09E8 á 2 BENGALI DIGIT TWO U+09E9 á 3 BENGALI DIGIT THREE U+09EA á 4 BENGALI DIGIT FOUR U+09EB á 5 BENGALI DIGIT FIVE U+09EC á 6 BENGALI DIGIT SIX U+09ED á 7 BENGALI DIGIT SEVEN U+09EE á 8 BENGALI DIGIT EIGHT U+09EF á 9 BENGALI DIGIT NINE U+09F4 1 BENGALI CURRENCY NUMERATOR ONE U+09F5 2 BENGALI CURRENCY NUMERATOR TWO U+09F6 3 BENGALI CURRENCY NUMERATOR THREE U+09F7 4 BENGALI CURRENCY NUMERATOR FOUR U+09F8 - BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR U+09F9 16 BENGALI CURRENCY DENOMINATOR SIXTEEN U+0A66 á 0 GURMUKHI DIGIT ZERO U+0A67 á 1 GURMUKHI DIGIT ONE U+0A68 á 2 GURMUKHI DIGIT TWO U+0A69 á 3 GURMUKHI DIGIT THREE U+0A6A á 4 GURMUKHI DIGIT FOUR U+0A6B á 5 GURMUKHI DIGIT FIVE U+0A6C á 6 GURMUKHI DIGIT SIX U+0A6D á 7 GURMUKHI DIGIT SEVEN U+0A6E á 8 GURMUKHI DIGIT EIGHT U+0A6F á 9 GURMUKHI DIGIT NINE U+0AE6 á 0 GUJARATI DIGIT ZERO U+0AE7 á 1 GUJARATI DIGIT ONE U+0AE8 á 2 GUJARATI DIGIT TWO U+0AE9 á 3 GUJARATI DIGIT THREE U+0AEA á 4 GUJARATI DIGIT FOUR U+0AEB á 5 GUJARATI DIGIT FIVE U+0AEC á 6 GUJARATI DIGIT SIX U+0AED á 7 GUJARATI DIGIT SEVEN U+0AEE á 8 GUJARATI DIGIT EIGHT U+0AEF á 9 GUJARATI DIGIT NINE U+0B66 á 0 ORIYA DIGIT ZERO U+0B67 á 1 ORIYA DIGIT ONE U+0B68 á 2 ORIYA DIGIT TWO

90 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+0B69 á 3 ORIYA DIGIT THREE U+0B6A á 4 ORIYA DIGIT FOUR U+0B6B á 5 ORIYA DIGIT FIVE U+0B6C á 6 ORIYA DIGIT SIX U+0B6D á 7 ORIYA DIGIT SEVEN U+0B6E á 8 ORIYA DIGIT EIGHT U+0B6F á 9 ORIYA DIGIT NINE U+0BE7 á 1 TAMIL DIGIT ONE U+0BE8 á 2 TAMIL DIGIT TWO U+0BE9 á 3 TAMIL DIGIT THREE U+0BEA á 4 TAMIL DIGIT FOUR U+0BEB á 5 TAMIL DIGIT FIVE U+0BEC á 6 TAMIL DIGIT SIX U+0BED á 7 TAMIL DIGIT SEVEN U+0BEE á 8 TAMIL DIGIT EIGHT U+0BEF á 9 TAMIL DIGIT NINE U+0BF0 10 TAMIL NUMBER TEN U+0BF1 100 TAMIL NUMBER ONE HUNDRED U+0BF2 1,000 TAMIL NUMBER ONE THOUSAND U+0C66 á 0 TELUGU DIGIT ZERO U+0C67 á 1 TELUGU DIGIT ONE U+0C68 á 2 TELUGU DIGIT TWO U+0C69 á 3 TELUGU DIGIT THREE U+0C6A á 4 TELUGU DIGIT FOUR U+0C6B á 5 TELUGU DIGIT FIVE U+0C6C á 6 TELUGU DIGIT SIX U+0C6D á 7 TELUGU DIGIT SEVEN U+0C6E á 8 TELUGU DIGIT EIGHT U+0C6F á 9 TELUGU DIGIT NINE U+0CE6 á 0 KANNADA DIGIT ZERO U+0CE7 á 1 KANNADA DIGIT ONE U+0CE8 á 2 KANNADA DIGIT TWO U+0CE9 á 3 KANNADA DIGIT THREE U+0CEA á 4 KANNADA DIGIT FOUR U+0CEB á 5 KANNADA DIGIT FIVE U+0CEC á 6 KANNADA DIGIT SIX U+0CED á 7 KANNADA DIGIT SEVEN U+0CEE á 8 KANNADA DIGIT EIGHT U+0CEF á 9 KANNADA DIGIT NINE U+0D66 á 0 MALAYALAM DIGIT ZERO U+0D67 á 1 MALAYALAM DIGIT ONE U+0D68 á 2 MALAYALAM DIGIT TWO U+0D69 á 3 MALAYALAM DIGIT THREE U+0D6A á 4 MALAYALAM DIGIT FOUR U+0D6B á 5 MALAYALAM DIGIT FIVE U+0D6C á 6 MALAYALAM DIGIT SIX U+0D6D á 7 MALAYALAM DIGIT SEVEN U+0D6E á 8 MALAYALAM DIGIT EIGHT U+0D6F á 9 MALAYALAM DIGIT NINE U+0E50 á 0 THAI DIGIT ZERO U+0E51 á 1 THAI DIGIT ONE U+0E52 á 2 THAI DIGIT TWO U+0E53 á 3 THAI DIGIT THREE U+0E54 á 4 THAI DIGIT FOUR U+0E55 á 5 THAI DIGIT FIVE U+0E56 á 6 THAI DIGIT SIX

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 91 4.6 Numeric Value—Normative Character Properties

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+0E57 á 7 THAI DIGIT SEVEN U+0E58 á 8 THAI DIGIT EIGHT U+0E59 á 9 THAI DIGIT NINE U+0ED0 á 0 LAO DIGIT ZERO U+0ED1 á 1 LAO DIGIT ONE U+0ED2 á 2 LAO DIGIT TWO U+0ED3 á 3 LAO DIGIT THREE U+0ED4 á 4 LAO DIGIT FOUR U+0ED5 á 5 LAO DIGIT FIVE U+0ED6 á 6 LAO DIGIT SIX U+0ED7 á 7 LAO DIGIT SEVEN U+0ED8 á 8 LAO DIGIT EIGHT U+0ED9 á 9 LAO DIGIT NINE U+0F20 á 0 TIBETAN DIGIT ZERO U+0F21 á 1 TIBETAN DIGIT ONE U+0F22 á 2 TIBETAN DIGIT TWO U+0F23 á 3 TIBETAN DIGIT THREE U+0F2D á 4 TIBETAN DIGIT FOUR U+0F25 á 5 TIBETAN DIGIT FIVE U+0F26 á 6 TIBETAN DIGIT SIX U+0F27 á 7 TIBETAN DIGIT SEVEN U+0F28 á 8 TIBETAN DIGIT EIGHT U+0F29 á 9 TIBETAN DIGIT NINE U+0F2A 1/2 TIBETAN DIGIT HALF ONE U+0F2B 3/2 TIBETAN DIGIT HALF TWO U+0F2C 5/2 TIBETAN DIGIT HALF THREE U+0F2D 7/2 TIBETAN DIGIT HALF FOUR U+0F2E 9/2 TIBETAN DIGIT HALF FIVE U+0F2F 11/2 TIBETAN DIGIT HALF SIX U+0F30 13/2 TIBETAN DIGIT HALF SEVEN U+0F31 15/2 TIBETAN DIGIT HALF EIGHT U+0F32 17/2 TIBETAN DIGIT HALF NINE U+0F33 –1/2 TIBETAN DIGIT HALF ZERO U+1040 á 0 MYANMAR DIGIT ZERO U+1041 á 1 MYANMAR DIGIT ONE U+1042 á 2 MYANMAR DIGIT TWO U+1043 á 3 MYANMAR DIGIT THREE U+1044 á 4 MYANMAR DIGIT FOUR U+1045 á 5 MYANMAR DIGIT FIVE U+1046 á 6 MYANMAR DIGIT SIX U+1047 á 7 MYANMAR DIGIT SEVEN U+1048 á 8 MYANMAR DIGIT EIGHT U+1049 á 9 MYANMAR DIGIT NINE U+1369 á 1 ETHIOPIC DIGIT ONE U+136A á 2 ETHIOPIC DIGIT TWO U+136B á 3 ETHIOPIC DIGIT THREE U+136C á 4 ETHIOPIC DIGIT FOUR U+136D á 5 ETHIOPIC DIGIT FIVE U+136E á 6 ETHIOPIC DIGIT SIX U+136F á 7 ETHIOPIC DIGIT SEVEN U+1370 á 8 ETHIOPIC DIGIT EIGHT U+1371 á 9 ETHIOPIC DIGIT NINE U+1372 10 ETHIOPIC NUMBER TEN U+1373 20 ETHIOPIC NUMBER TWENTY U+1374 30 ETHIOPIC NUMBER THIRTY U+1375 40 ETHIOPIC NUMBER FORTY U+1376 50 ETHIOPIC NUMBER FIFTY

92 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+1377 60 ETHIOPIC NUMBER SIXTY U+1378 70 ETHIOPIC NUMBER SEVENTY U+1379 80 ETHIOPIC NUMBER EIGHTY U+137A 90 ETHIOPIC NUMBER NINETY U+137B 100 ETHIOPIC NUMBER HUNDRED U+137C 10,000 ETHIOPIC NUMBER TEN THOUSAND U+16EE 17 RUNIC ARLAUG SYMBOL U+16EF 18 RUNIC TVIMADUR SYMBOL U+16F0 19 RUNIC BELGTHOR SYMBOL U+17E0 á 0 KHMER DIGIT ZERO U+17E1 á 1 KHMER DIGIT ONE U+17E2 á 2 KHMER DIGIT TWO U+17E3 á 3 KHMER DIGIT THREE U+17E4 á 4 KHMER DIGIT FOUR U+17E5 á 5 KHMER DIGIT FIVE U+17E6 á 6 KHMER DIGIT SIX U+17E7 á 7 KHMER DIGIT SEVEN U+17E8 á 8 KHMER DIGIT EIGHT U+17E9 á 9 KHMER DIGIT NINE U+1810 á 0 MONGOLIAN DIGIT ZERO U+1811 á 1 MONGOLIAN DIGIT ONE U+1812 á 2 MONGOLIAN DIGIT TWO U+1813 á 3 MONGOLIAN DIGIT THREE U+1814 á 4 MONGOLIAN DIGIT FOUR U+1815 á 5 MONGOLIAN DIGIT FIVE U+1816 á 6 MONGOLIAN DIGIT SIX U+1817 á 7 MONGOLIAN DIGIT SEVEN U+1818 á 8 MONGOLIAN DIGIT EIGHT U+1819 á 9 MONGOLIAN DIGIT NINE U+2070 0 SUPERSCRIPT ZERO U+2074 4 SUPERSCRIPT FOUR U+2075 5 SUPERSCRIPT FIVE U+2076 6 SUPERSCRIPT SIX U+2077 7 SUPERSCRIPT SEVEN U+2078 8 SUPERSCRIPT EIGHT U+2079 9 SUPERSCRIPT NINE U+2080 0 SUBSCRIPT ZERO U+2081 1 SUBSCRIPT ONE U+2082 2 SUBSCRIPT TWO U+2083 3 SUBSCRIPT THREE U+2084 4 SUBSCRIPT FOUR U+2085 5 SUBSCRIPT FIVE U+2086 6 SUBSCRIPT SIX U+2087 7 SUBSCRIPT SEVEN U+2088 8 SUBSCRIPT EIGHT U+2089 9 SUBSCRIPT NINE U+2153 1/3 VULGAR FRACTION ONE THIRD U+2154 2/3 VULGAR FRACTION TWO THIRDS U+2155 1/5 VULGAR FRACTION ONE FIFTH U+2156 2/5 VULGAR FRACTION TWO FIFTHS U+2157 3/5 VULGAR FRACTION THREE FIFTHS U+2158 4/5 VULGAR FRACTION FOUR FIFTHS U+2159 1/6 VULGAR FRACTION ONE SIXTH U+215A 5/6 VULGAR FRACTION FIVE SIXTHS U+215B 1/8 VULGAR FRACTION ONE EIGHTH U+215C 3/8 VULGAR FRACTION THREE EIGHTHS U+215D 5/8 VULGAR FRACTION FIVE EIGHTHS U+215E 7/8 VULGAR FRACTION SEVEN EIGHTHS U+215F 1 FRACTION NUMERATOR ONE

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 93 4.6 Numeric Value—Normative Character Properties

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+2160 1 ROMAN NUMERAL ONE U+2161 2 ROMAN NUMERAL TWO U+2162 3 ROMAN NUMERAL THREE U+2163 4 ROMAN NUMERAL FOUR U+2164 5 ROMAN NUMERAL FIVE U+2165 6 ROMAN NUMERAL SIX U+2166 7 ROMAN NUMERAL SEVEN U+2167 8 ROMAN NUMERAL EIGHT U+2168 9 ROMAN NUMERAL NINE U+2169 10 ROMAN NUMERAL TEN U+216A 11 ROMAN NUMERAL ELEVEN U+216B 12 ROMAN NUMERAL TWELVE U+216C 50 ROMAN NUMERAL FIFTY U+216D 100 ROMAN NUMERAL ONE HUNDRED U+216E 500 ROMAN NUMERAL FIVE HUNDRED U+216F 1,000 ROMAN NUMERAL ONE THOUSAND U+2170 1 SMALL ROMAN NUMERAL ONE U+2171 2 SMALL ROMAN NUMERAL TWO U+2172 3 SMALL ROMAN NUMERAL THREE U+2173 4 SMALL ROMAN NUMERAL FOUR U+2174 5 SMALL ROMAN NUMERAL FIVE U+2175 6 SMALL ROMAN NUMERAL SIX U+2176 7 SMALL ROMAN NUMERAL SEVEN U+2177 8 SMALL ROMAN NUMERAL EIGHT U+2178 9 SMALL ROMAN NUMERAL NINE U+2179 10 SMALL ROMAN NUMERAL TEN U+217A 11 SMALL ROMAN NUMERAL ELEVEN U+217B 12 SMALL ROMAN NUMERAL TWELVE U+217C 50 SMALL ROMAN NUMERAL FIFTY U+217D 100 SMALL ROMAN NUMERAL ONE HUNDRED U+217E 500 SMALL ROMAN NUMERAL FIVE HUNDRED U+217F 1,000 SMALL ROMAN NUMERAL ONE THOUSAND U+2180 1,000 ROMAN NUMERAL ONE THOUSAND C D U+2181 5,000 ROMAN NUMERAL FIVE THOUSAND U+2182 10,000 ROMAN NUMERAL TEN THOUSAND U+2183 - ROMAN NUMERAL REVERSED ONE HUNDRED U+2460 1 CIRCLED DIGIT ONE U+2461 2 CIRCLED DIGIT TWO U+2462 3 CIRCLED DIGIT THREE U+2463 4 CIRCLED DIGIT FOUR U+2464 5 CIRCLED DIGIT FIVE U+2465 6 CIRCLED DIGIT SIX U+2466 7 CIRCLED DIGIT SEVEN U+2467 8 CIRCLED DIGIT EIGHT U+2468 9 CIRCLED DIGIT NINE U+2469 10 CIRCLED NUMBER TEN U+246A 11 CIRCLED NUMBER ELEVEN U+246B 12 CIRCLED NUMBER TWELVE U+246C 13 CIRCLED NUMBER THIRTEEN U+246D 14 CIRCLED NUMBER FOURTEEN U+246E 15 CIRCLED NUMBER FIFTEEN U+246F 16 CIRCLED NUMBER SIXTEEN U+2470 17 CIRCLED NUMBER SEVENTEEN U+2471 18 CIRCLED NUMBER EIGHTEEN U+2472 19 CIRCLED NUMBER NINETEEN U+2473 20 CIRCLED NUMBER TWENTY U+2474 1 PARENTHESIZED DIGIT ONE U+2475 2 PARENTHESIZED DIGIT TWO U+2476 3 PARENTHESIZED DIGIT THREE U+2477 4 PARENTHESIZED DIGIT FOUR U+2478 5 PARENTHESIZED DIGIT FIVE

94 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+2479 6 PARENTHESIZED DIGIT SIX U+247A 7 PARENTHESIZED DIGIT SEVEN U+247B 8 PARENTHESIZED DIGIT EIGHT U+247C 9 PARENTHESIZED DIGIT NINE U+247D 10 PARENTHESIZED NUMBER TEN U+247E 11 PARENTHESIZED NUMBER ELEVEN U+247F 12 PARENTHESIZED NUMBER TWELVE U+2480 13 PARENTHESIZED NUMBER THIRTEEN U+2481 14 PARENTHESIZED NUMBER FOURTEEN U+2482 15 PARENTHESIZED NUMBER FIFTEEN U+2483 16 PARENTHESIZED NUMBER SIXTEEN U+2484 17 PARENTHESIZED NUMBER SEVENTEEN U+2485 18 PARENTHESIZED NUMBER EIGHTEEN U+2486 19 PARENTHESIZED NUMBER NINETEEN U+2487 20 PARENTHESIZED NUMBER TWENTY U+2488 1 DIGIT ONE FULL STOP U+2489 2 DIGIT TWO FULL STOP U+248A 3 DIGIT THREE FULL STOP U+248B 4 DIGIT FOUR FULL STOP U+248C 5 DIGIT FIVE FULL STOP U+248D 6 DIGIT SIX FULL STOP U+248E 7 DIGIT SEVEN FULL STOP U+248F 8 DIGIT EIGHT FULL STOP U+2490 9 DIGIT NINE FULL STOP U+2491 10 NUMBER TEN FULL STOP U+2492 11 NUMBER ELEVEN FULL STOP U+2493 12 NUMBER TWELVE FULL STOP U+2494 13 NUMBER THIRTEEN FULL STOP U+2495 14 NUMBER FOURTEEN FULL STOP U+2496 15 NUMBER FIFTEEN FULL STOP U+2497 16 NUMBER SIXTEEN FULL STOP U+2498 17 NUMBER SEVENTEEN FULL STOP U+2499 18 NUMBER EIGHTEEN FULL STOP U+249A 19 NUMBER NINETEEN FULL STOP U+249B 20 NUMBER TWENTY FULL STOP U+24EA 0 CIRCLED DIGIT ZERO U+2776 1 NEGATIVE CIRCLED DIGIT ONE U+2777 2 DINGBAT NEGATIVE CIRCLED DIGIT TWO U+2778 3 DINGBAT NEGATIVE CIRCLED DIGIT THREE U+2779 4 DINGBAT NEGATIVE CIRCLED DIGIT FOUR U+277A 5 DINGBAT NEGATIVE CIRCLED DIGIT FIVE U+277B 6 DINGBAT NEGATIVE CIRCLED DIGIT SIX U+277C 7 DINGBAT NEGATIVE CIRCLED DIGIT SEVEN U+277D 8 DINGBAT NEGATIVE CIRCLED DIGIT EIGHT U+277E 9 DINGBAT NEGATIVE CIRCLED DIGIT NINE U+277F 10 DINGBAT NEGATIVE CIRCLED NUMBER TEN U+2780 1 DINGBAT CIRCLED SANS- DIGIT ONE U+2781 2 DINGBAT CIRCLED SANS-SERIF DIGIT TWO U+2782 3 DINGBAT CIRCLED SANS-SERIF DIGIT THREE U+2783 4 DINGBAT CIRCLED SANS-SERIF DIGIT FOUR U+2784 5 DINGBAT CIRCLED SANS-SERIF DIGIT FIVE U+2785 6 DINGBAT CIRCLED SANS-SERIF DIGIT SIX U+2786 7 DINGBAT CIRCLED SANS-SERIF DIGIT SEVEN U+2787 8 DINGBAT CIRCLED SANS-SERIF DIGIT EIGHT U+2788 9 DINGBAT CIRCLED SANS-SERIF DIGIT NINE U+2789 10 DINGBAT CIRCLED SANS-SERIF NUMBER TEN U+278A 1 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE U+278B 2 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT TWO U+278C 3 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT THREE U+278D 4 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT FOUR

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 95 4.6 Numeric Value—Normative Character Properties

Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+278E 5 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT FIVE U+278F 6 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT SIX U+2790 7 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT SEVEN U+2791 8 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT EIGHT U+2792 9 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT NINE U+2793 10 DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN U+3007 0 IDEOGRAPHIC NUMBER ZERO U+3021 1 HANGZHOU NUMERAL ONE U+3022 2 HANGZHOU NUMERAL TWO U+3023 3 HANGZHOU NUMERAL THREE U+3024 4 HANGZHOU NUMERAL FOUR U+3025 5 HANGZHOU NUMERAL FIVE U+3026 6 HANGZHOU NUMERAL SIX U+3027 7 HANGZHOU NUMERAL SEVEN U+3028 8 HANGZHOU NUMERAL EIGHT U+3029 9 HANGZHOU NUMERAL NINE U+3038 10 HANGZHOU NUMERAL TEN U+3039 20 HANGZHOU NUMERAL TWENTY U+303A 30 HANGZHOU NUMERAL THIRTY U+3192 1 IDEOGRAPHIC ANNOTATION ONE MARK U+3193 2 IDEOGRAPHIC ANNOTATION TWO MARK U+3194 3 IDEOGRAPHIC ANNOTATION THREE MARK U+3195 4 IDEOGRAPHIC ANNOTATION FOUR MARK U+3220 1 PARENTHESIZED IDEOGRAPH ONE U+3221 2 PARENTHESIZED IDEOGRAPH TWO U+3222 3 PARENTHESIZED IDEOGRAPH THREE U+3223 4 PARENTHESIZED IDEOGRAPH FOUR U+3224 5 PARENTHESIZED IDEOGRAPH FIVE U+3225 6 PARENTHESIZED IDEOGRAPH SIX U+3226 7 PARENTHESIZED IDEOGRAPH SEVEN U+3227 8 PARENTHESIZED IDEOGRAPH EIGHT U+3228 9 PARENTHESIZED IDEOGRAPH NINE U+3229 10 PARENTHESIZED IDEOGRAPH TEN U+3280 1 CIRCLED IDEOGRAPH ONE U+3281 2 CIRCLED IDEOGRAPH TWO U+3282 3 CIRCLED IDEOGRAPH THREE U+3283 4 CIRCLED IDEOGRAPH FOUR U+3284 5 CIRCLED IDEOGRAPH FIVE U+3285 6 CIRCLED IDEOGRAPH SIX U+3286 7 CIRCLED IDEOGRAPH SEVEN U+3287 8 CIRCLED IDEOGRAPH EIGHT U+3288 9 CIRCLED IDEOGRAPH NINE U+3289 10 CIRCLED IDEOGRAPH TEN U+FF10 á 0 FULLWIDTH DIGIT ZERO U+FF11 á 1 FULLWIDTH DIGIT ONE U+FF12 á 2 FULLWIDTH DIGIT TWO U+FF13 á 3 FULLWIDTH DIGIT THREE U+FF14 á 4 FULLWIDTH DIGIT FOUR U+FF15 á 5 FULLWIDTH DIGIT FIVE U+FF16 á 6 FULLWIDTH DIGIT SIX U+FF17 á 7 FULLWIDTH DIGIT SEVEN U+FF18 á 8 FULLWIDTH DIGIT EIGHT U+FF19 á 9 FULLWIDTH DIGIT NINE

CJK ideographs from the Unified Repertoire and Ordering also may have numeric values. The primary numeric ideographs are shown in Table 4-7. When used to represent numbers

96 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.7 Mirrored—Normative in decimal notation, zero is represented by U+3007. Otherwise, zero is represented by U+96F6. Table 4-7. Primary Numeric Ideographs U+96F6 0 U+4E00 1 U+4E8C 2 U+4E09 3 U+56DB 4 U+4E94 5 U+516D 6 U+4E03 7 U+516B 8 U+4E5D 9 U+5341 10 U+767E 100 U+5343 1,000 U+4E07 10,000 U+5104 100,000,000 (10,000 × 10,000) U+5146 1,000,000,000,000 (10,000 × 10,000 × 10,000)

Ideographic accounting numbers are commonly used on checks and other financial instru- ments to minimize the possibilities of misinterpretation or fraud in the representation of numerical values. The set of accounting numbers varies somewhat between Japanese, Chi- nese, and Korean usage. Table 4-8 gives a fairly complete listing of the known accounting characters. Some of these characters are ideographs with other meanings pressed into ser- vice as accounting numbers; others are used only as accounting numbers. Table 4-8. Ideographs Used as Accounting Numbers 1 U+58F9, U+58F1, U+5F0Ca 2U+8CAEa, U+8D30a, U+5F10a, U+5F0Da 3 U+53C3, U+53C2, U+53C1a, U+5F0Ea 4 U+8086 5U+4F0D 6 U+9678, U+9646 7 U+67D2b 8 U+634C 9 U+7396 10 U+62FE 100 U+4F70a, U+964C 1,000 U+4EDF 10,000 U+842C

a. These characters are used only as accounting numbers, and have no other meaning. b. In Japan, U+67D2 is also pronounced urusi, meaning “lacquer,” and is treated as a variant of the standard character for “lacquer” U+6F06.

4.7 Mirrored—Normative Mirrored is a normative property of characters such as parentheses, whose images are mir- rored horizontally in text that is laid out from right to left. For example, U+0028   is interpreted as opening parenthesis; in a left-to-right context it will appear as “(”, while in a right-to-left context it will appear as the mirrored glyph “)”. The list of mir- rored characters appears in Table 4-9. Note that mirroring is not limited to paired charac- ters, but that any character with the mirrored property will need two mirrored glyphs. This requirement is necessary to render the character properly in a bidirectional context.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 97 4.7 Mirrored—Normative Character Properties

Table 4-9. Mirrored Characters Value Name U+0028 LEFT PARENTHESIS U+0029 RIGHT PARENTHESIS U+003C LESS-THAN SIGN U+003E GREATER-THAN SIGN U+005B LEFT SQUARE U+005D RIGHT SQUARE BRACKET U+007B LEFT CURLY BRACKET U+007D RIGHT CURLY BRACKET U+2045 LEFT SQUARE BRACKET WITH QUILL U+2046 RIGHT SQUARE BRACKET WITH QUILL U+207D SUPERSCRIPT LEFT PARENTHESIS U+207E SUPERSCRIPT RIGHT PARENTHESIS U+208D SUBSCRIPT LEFT PARENTHESIS U+208E SUBSCRIPT RIGHT PARENTHESIS U+2201 COMPLEMENT U+2202 PARTIAL DIFFERENTIAL U+2203 THERE EXISTS U+2204 THERE DOES NOT EXIST U+2208 ELEMENT OF U+2209 NOT AN ELEMENT OF U+220A SMALL ELEMENT OF U+220B CONTAINS AS MEMBER U+220C DOES NOT CONTAIN AS MEMBER U+220D SMALL CONTAINS AS MEMBER U+2211 N-ARY SUMMATION U+2215 DIVISION SLASH U+2216 SET MINUS U+221A SQUARE ROOT U+221B CUBE ROOT U+221C FOURTH ROOT U+221D PROPORTIONAL TO U+221F RIGHT ANGLE U+2220 ANGLE U+2221 MEASURED ANGLE U+2222 SPHERICAL ANGLE U+2224 DOES NOT DIVIDE U+2226 NOT PARALLEL TO U+222B INTEGRAL U+222C DOUBLE INTEGRAL U+222D TRIPLE INTEGRAL U+222E CONTOUR INTEGRAL U+222F SURFACE INTEGRAL U+2230 VOLUME INTEGRAL U+2231 CLOCKWISE INTEGRAL U+2232 CLOCKWISE CONTOUR INTEGRAL U+2233 ANTICLOCKWISE CONTOUR INTEGRAL U+2239 EXCESS U+223B HOMOTHETIC U+223C TILDE OPERATOR U+223D REVERSED TILDE U+223E INVERTED LAZY S U+223F SINE WAVE U+2240 WREATH PRODUCT U+2241 NOT TILDE U+2242 MINUS TILDE U+2243 ASYMPTOTICALLY EQUAL TO U+2244 NOT ASYMPTOTICALLY EQUAL TO U+2245 APPROXIMATELY EQUAL TO U+2246 APPROXIMATELY BUT NOT ACTUALLY EQUAL TO U+2247 NEITHER APPROXIMATELY NOR ACTUALLY EQUAL TO U+2248 ALMOST EQUAL TO

98 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.7 Mirrored—Normative

Table 4-9. Mirrored Characters (Continued) Value Name U+2249 NOT ALMOST EQUAL TO U+224A ALMOST EQUAL OR EQUAL TO U+224B TRIPLE TILDE U+224C ALL EQUAL TO U+2252 APPROXIMATELY EQUAL TO OR THE IMAGE OF U+2253 IMAGE OF OR APPROXIMATELY EQUAL TO U+2254 COLON EQUALS U+2255 EQUALS COLON U+225F QUESTIONED EQUAL TO U+2260 NOT EQUAL TO U+2262 NOT IDENTICAL TO U+2264 LESS-THAN OR EQUAL TO U+2265 GREATER-THAN OR EQUAL TO U+2266 LESS-THAN OVER EQUAL TO U+2267 GREATER-THAN OVER EQUAL TO U+2268 LESS-THAN BUT NOT EQUAL TO U+2269 GREATER-THAN BUT NOT EQUAL TO U+226A MUCH LESS-THAN U+226B MUCH GREATER-THAN U+226E NOT LESS-THAN U+226F NOT GREATER-THAN U+2270 NEITHER LESS-THAN NOR EQUAL TO U+2271 NEITHER GREATER-THAN NOR EQUAL TO U+2272 LESS-THAN OR EQUIVALENT TO U+2273 GREATER-THAN OR EQUIVALENT TO U+2274 NEITHER LESS-THAN NOR EQUIVALENT TO U+2275 NEITHER GREATER-THAN NOR EQUIVALENT TO U+2276 LESS-THAN OR GREATER-THAN U+2277 GREATER-THAN OR LESS-THAN U+2278 NEITHER LESS-THAN NOR GREATER-THAN U+2279 NEITHER GREATER-THAN NOR LESS-THAN U+227A PRECEDES U+227B SUCCEEDS U+227C PRECEDES OR EQUAL TO U+227D SUCCEEDS OR EQUAL TO U+227E PRECEDES OR EQUIVALENT TO U+227F SUCCEEDS OR EQUIVALENT TO U+2280 DOES NOT PRECEDE U+2281 DOES NOT SUCCEED U+2282 SUBSET OF U+2283 SUPERSET OF U+2284 NOT A SUBSET OF U+2285 NOT A SUPERSET OF U+2286 SUBSET OF OR EQUAL TO U+2287 SUPERSET OF OR EQUAL TO U+2288 NEITHER A SUBSET OF NOR EQUAL TO U+2289 NEITHER A SUPERSET OF NOR EQUAL TO U+228A SUBSET OF WITH NOT EQUAL TO U+228B SUPERSET OF WITH NOT EQUAL TO U+228C MULTISET U+228F SQUARE IMAGE OF U+2290 SQUARE ORIGINAL OF U+2291 SQUARE IMAGE OF OR EQUAL TO U+2292 SQUARE ORIGINAL OF OR EQUAL TO U+2298 CIRCLED DIVISION SLASH U+22A2 RIGHT TACK U+22A3 LEFT TACK U+22A6 ASSERTION U+22A7 MODELS U+22A8 TRUE U+22A9 FORCES

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 99 4.7 Mirrored—Normative Character Properties

Table 4-9. Mirrored Characters (Continued) Value Name U+22AA TRIPLE RIGHT U+22AB DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE U+22AC DOES NOT PROVE U+22AD NOT TRUE U+22AE DOES NOT FORCE U+22AF NEGATED DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE U+22B0 PRECEDES UNDER RELATION U+22B1 SUCCEEDS UNDER RELATION U+22B2 NORMAL SUBGROUP OF U+22B3 CONTAINS AS NORMAL SUBGROUP U+22B4 NORMAL SUBGROUP OF OR EQUAL TO U+22B5 CONTAINS AS NORMAL SUBGROUP OR EQUAL TO U+22B6 ORIGINAL OF U+22B7 IMAGE OF U+22B8 MULTIMAP U+22BE RIGHT ANGLE WITH ARC U+22BF RIGHT TRIANGLE U+22C9 LEFT NORMAL FACTOR SEMIDIRECT PRODUCT U+22CA RIGHT NORMAL FACTOR SEMIDIRECT PRODUCT U+22CB LEFT SEMIDIRECT PRODUCT U+22CC RIGHT SEMIDIRECT PRODUCT U+22CD REVERSED TILDE EQUALS U+22D0 DOUBLE SUBSET U+22D1 DOUBLE SUPERSET U+22D6 LESS-THAN WITH DOT U+22D7 GREATER-THAN WITH DOT U+22D8 VERY MUCH LESS-THAN U+22D9 VERY MUCH GREATER-THAN U+22DA LESS-THAN EQUAL TO OR GREATER-THAN U+22DB GREATER-THAN EQUAL TO OR LESS-THAN U+22DC EQUAL TO OR LESS-THAN U+22DD EQUAL TO OR GREATER-THAN U+22DE EQUAL TO OR PRECEDES U+22DF EQUAL TO OR SUCCEEDS U+22E0 DOES NOT PRECEDE OR EQUAL U+22E1 DOES NOT SUCCEED OR EQUAL U+22E2 NOT SQUARE IMAGE OF OR EQUAL TO U+22E3 NOT SQUARE ORIGINAL OF OR EQUAL TO U+22E4 SQUARE IMAGE OF OR NOT EQUAL TO U+22E5 SQUARE ORIGINAL OF OR NOT EQUAL TO U+22E6 LESS-THAN BUT NOT EQUIVALENT TO U+22E7 GREATER-THAN BUT NOT EQUIVALENT TO U+22E8 PRECEDES BUT NOT EQUIVALENT TO U+22E9 SUCCEEDS BUT NOT EQUIVALENT TO U+22EA NOT NORMAL SUBGROUP OF U+22EB DOES NOT CONTAIN AS NORMAL SUBGROUP U+22EC NOT NORMAL SUBGROUP OF OR EQUAL TO U+22ED DOES NOT CONTAIN AS NORMAL SUBGROUP OR EQUAL U+22F0 UP RIGHT DIAGONAL U+22F1 DOWN RIGHT DIAGONAL ELLIPSIS U+2308 LEFT CEILING U+2309 RIGHT CEILING U+230A LEFT FLOOR U+230B RIGHT FLOOR U+2320 TOP HALF INTEGRAL U+2321 BOTTOM HALF INTEGRAL U+2329 LEFT-POINTING ANGLE BRACKET U+232A RIGHT-POINTING ANGLE BRACKET U+3008 LEFT ANGLE BRACKET U+3009 RIGHT ANGLE BRACKET U+300A LEFT DOUBLE ANGLE BRACKET

100 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.8 Unicode 1.0 Names

Table 4-9. Mirrored Characters (Continued) Value Name U+300B RIGHT DOUBLE ANGLE BRACKET U+300C LEFT CORNER BRACKET U+300D RIGHT CORNER BRACKET U+300E LEFT WHITE CORNER BRACKET U+300F RIGHT WHITE CORNER BRACKET U+3010 LEFT BLACK LENTICULAR BRACKET U+3011 RIGHT BLACK LENTICULAR BRACKET U+3014 LEFT TORTOISE SHELL BRACKET U+3015 RIGHT TORTOISE SHELL BRACKET U+3016 LEFT WHITE LENTICULAR BRACKET U+3017 RIGHT WHITE LENTICULAR BRACKET U+3018 LEFT WHITE TORTOISE SHELL BRACKET U+3019 RIGHT WHITE TORTOISE SHELL BRACKET U+301A LEFT WHITE SQUARE BRACKET U+301B RIGHT WHITE SQUARE BRACKET

4.8 Unicode 1.0 Names The Unicode 1.0 character name is an informative property of the characters defined in Ver- sion 1.0 of the Unicode Standard. The names of Unicode characters were changed in the process of merging the standard with ISO/IEC 10646. The Version 1.0 character names can be obtained from the CD-ROM accompanying the standard or from the ftp site. See also Appendix D, Changes from Unicode Version 2.0. Where the Version 1.0 character name pro- vides additional useful information, it is listed in Chapter 14, Code Charts. For example, U+00B6   has its Version 1.0 name,  , listed for clarity.

4.9 Mathematical Property The mathematical property is an informative property of characters that are used as opera- tors in mathematical formulas. The mathematical property may be useful in algorithms that deal with the display of mathematical text and formulas. However, a number of these characters have multiple usages and may occur with nonmathematical semantics. For example, U+002D - may also be used as a hyphen—and not as a mathemat- ical minus sign. Other characters, including some alphabetic, numeric, punctuation, spaces, arrows, and geometric shapes, are used in mathematical expressions as well, but are even more dependent on the context for their identification. The Unicode characters in the following lists have the mathematical property. Characters with the math property and the Sm General Category: 002B, 003C..003E, 007C, 007E, 00AC, 00B1, 00D7, 00F7, 2044, 207A..207C, 208A..208C, 2190..2194, 219A..219B, 21A0, 21A3, 21A6, 21AE, 21CE..21CF, 21D2, 21D4, 2200..22F1, 2308..230B, 2320..2321, 25B7, 25C1, 266F, FB29, FE62, FE64..FE66, FF0B, FF1C..FF1E, FF5C, FF5E, FFE2, FFE9..FFEC Characters with the math property and other General Category values: 0028..002A, 002D, 002F, 005B..005E, 007B, 007D, 2016, 2032..2034, 207D..207E, 208D..208E, 20D0..20DC, 20E1, 2329..232A, 300A..300B, 301A..301B, FE35..FE38, FE59..FE5C, FE61, FE63, FE68, FF08..FF0A, FF0D, FF0F, FF3B..FF3E, FF5B, FF5D

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 101 4.10 Letters and Other Useful Properties Character Properties

4.10 Letters and Other Useful Properties The CD-ROM that accompanies the Unicode Standard contains data files that list other useful, informative properties of Unicode characters. The full list of those properties can be found in the data files; see, in particular, PropList.txt. This section highlights some of those properties that have a bearing on such implementation issues as parsing of identifiers. (See also Section 5.16, Identifiers.) Computer language standards often characterize identifiers as consisting of letters, sylla- bles, ideographs, and digits, but do not specify exactly what a “letter,” “syllable,” “ideo- graph,” or “digit” is, leaving the definitions implicitly either to a character encoding standard or to a locale specification. The large scope of the Unicode Standard means that it includes many writing systems for which these distinctions are not as self-evident as they may once have been for systems designed to work primarily for Western European lan- guages and Japanese. In particular, while the Unicode Standard includes various “alpha- bets” and “,” it also includes writing systems that fall somewhere in between. As a result, no attempt is made to draw a sharp property distinction between letters and sylla- bles. Letter. This informative property applies to characters that are used to write words. This group includes characters such as capital letters, small letters, ideographs, hangul, and . Combining marks generally assume the letter property of the pre- ceding base character. For example, when searching for word boundaries, combining char- acters don’t break from previous letters. The letter property mappings can be obtained from the CD-ROM accompanying the standard. Alphabetic. The alphabetic property is an informative property of the primary units of alphabets and/or syllabaries, whether combining or noncombining. Included in this group would be composite characters that are canonical equivalents to a combining character sequence of an alphabetic base character plus one or more combining characters; letter digraphs; contextual variants of alphabetic characters; ligatures of alphabetic characters; contextual variants of ligatures; modifier letters; that are compatibility equivalents of single alphabetic letters; and miscellaneous letter elements. Notably, U+00AA    and U+00BA    are simply abbreviatory forms involving a Latin letter and should be considered alphabetic rather than nonalphabetic symbols. Ideographic. The ideographic property is an informative property of the Unified CJK Ideo- graph set (U+4E00..U+9FA5); the CJK Ideograph Extension A set (U+3400..U+4DB5); the CJK Compatibility Ideograph set (U+F900..U+FA2D); U+3007   ; U+3006   ; and the Hangzhou-style numerals (U+3021.. U+3029, U+3038..U+303A).

102 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 5 Implementation Guidelines 5

It is possible to implement a substantial subset of the Unicode Standard as “wide-ASCII” with little change to existing programming practice. However, the Unicode Standard also provides for languages and writing systems that have more complex behavior than English. Whether implementing a new from the ground up or enhancing existing programming environments or applications, it is necessary to examine many aspects of current programming practice and conventions to deal with this more complex behavior. This chapter covers a series of short, self-contained topics that are useful for implementers. The information and examples presented here are meant to help implementers understand and apply the design and features of the Unicode Standard. That is, they are meant to pro- mote good practice in implementations conforming to the Unicode Standard. These recommended guidelines are not normative and are not binding on the imple- menter.

5.1 Transcoding to Other Standards The Unicode Standard exists in a world of other text and character encoding standards— some private, some national, some international. A major strength of the Unicode Stan- dard is the number of other important standards that it incorporates. In many cases, the Unicode Standard included duplicate characters to guarantee round-trip transcoding to established and widely used standards. Conversion of characters between standards is not always a straightforward proposition. Many characters have mixed semantics in one standard and may correspond to more than one character in another. Sometimes standards give duplicate encodings for the same char- acter; at other times the interpretation of a whole set of characters may depend on the application. Finally, there are subtle differences in what a standard may consider a charac- ter.

Issues The Unicode Standard can be used as a pivot to transcode among n different standards. This process, which is sometimes called triangulation, reduces the number of mapping tables an implementation needs from O(n2) to O(n). Generally, tables—as opposed to algorithmic transformation—are required to map between the Unicode Standard and another standard. Table lookup often yields much better performance than even simple algorithmic conversions, as can be implemented between JIS and Shift-JIS.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 105 5.1 Transcoding to Other Standards Implementation Guidelines

Multistage Tables Tables require space. Even small character sets often map to characters from several differ- ent blocks in the Unicode Standard, and thus may contain up to 64K entries in at least one direction. Several techniques exist to reduce the memory space requirements for mapping tables. Such techniques apply not only to transcoding tables, but also to many other tables needed to implement the Unicode Standard, including character property data, collation tables, and glyph selection tables. Flat Tables. If diskspace is not at issue, virtual memory architectures yield acceptable working set sizes even for flat tables because frequency of usage among characters differs widely and even small character sets contain many infrequently used characters. In addi- tion, data intended to be mapped into a given character set generally does not contain char- acters from all blocks of the Unicode Standard (usually, only a few blocks at a time need to be transcoded to a given character set). This situation leaves large sections of the 64K-sized reverse mapping tables (containing the default character, or unmappable character entry) unused—and therefore paged to disk. Ranges. It may be tempting to “optimize” these tables for space by providing elaborate pro- visions for nested ranges or similar devices. This practice leads to unnecessary performance penalties on modern, highly pipelined processor architectures because of branch penalties. A faster solution is to use an optimized two-stage table, which can be coded without any test or branch instructions. Hash tables can also be used for space optimization, although they are not as fast as multistage tables. Two-Stage Tables. Two-stage (high-byte) tables are a commonly employed mechanism to reduce table size (see Figure 5-1). They use an array of 256 pointers and a default value. If a pointer is NULL, the returned value is the default. Otherwise, the pointer references a block of 256 values. Figure 5-1. Two-Stage Tables

Optimized Two-Stage Table. Wherever any blocks are identical, the pointers just point to the same block. For transcoding tables, this case occurs generally for a block containing only mappings to the “default” or “unmappable” character. Instead of using NULL pointers and a default value, one “shared” block of 256 default entries is created. This block is pointed to by all first-stage table entries, for which no character value can be mapped. By

106 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.2 ANSI/ISO C wchar_t avoiding tests and branches, this strategy provides access time that approaches the simple array access, but at a great savings in storage. Given an arbitrary 64K table, it is a simple matter to write a small utility that can calculate the optimal number of stages and their width.

7-Bit or 8-Bit Transmission Some transmission protocols use ASCII control codes for flow control. Others, including some mailers, are notorious for their restrictions to 7-bit ASCII. In these cases, transmissions of Unicode-encoded text must be encapsulated. A number of encapsulation protocols exist today, such as uuencode and BinHex. These protocols can be combined with compression in the same pass, thereby reducing the transmission overhead. The UTF-8 Transformation Format described in Section 2.3, Encoding Forms, and in Section 3.8, Transformations, may be used to transmit Unicode-encoded data through 8-bit transmission paths. The UTF-7 Transformation Format, which is defined in RFC-2152, also exists for use with MIME.

Mapping Table Resources To assist and guide implementers, the Unicode Standard provides a series of mapping tables on the accompanying CD-ROM. Each table consists of one-to-one mappings from the Unicode Standard to another published character standard. The tables include occa- sional multiple mappings. Their primary function is to identify the characters in these standards in the context of the Unicode Standard. In many cases, between the Unicode Standard and other standards will be application-dependent or context- sensitive. Many vendors maintain mapping tables for their own character standards.

Disclaimer The content of all mapping tables has been verified as far as possible by the Uni- code Consortium. However, the Unicode Consortium does not guarantee that the tables are correct in every detail. The mapping tables are provided for informa- tional purposes only. The Unicode Consortium is not responsible for errors that may occur either in the mapping tables printed in this volume or found on the CD-ROM, or in software that implements those tables. All implementers should check the relevant international, national, and vendor standards in cases where ambiguity of interpretation may occur.

5.2 ANSI/ISO C wchar_t With the wchar_t type, ANSI/ISO C provides for inclusion of fixed- width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension. The Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses ASCII to code the portable C execution set, the use of the Unicode character set for the wchar_t type, with a width of 16 bits, fulfills the requirement.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 107 5.3 Unknown and Missing Characters Implementation Guidelines

The width of wchar_t is compiler-specific and can be as little as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. However, programmers can use a macro or typedef (for example, UNICHAR) that can be compiled as unsigned short or wchar_t depending on the target compiler and platform. This choice enables correct compilation on different platforms and compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or typedefs may be predefined (for example, TCHAR on the Win32 API). On systems where the native character type or wchar_t is implemented as a 32-bit quan- tity, an implementation may transiently use 32-bit quantities to represent Unicode charac- ters during processing. The internal workings of this representation are treated as a black box and are not Unicode-conformant. In particular, any API or runtime library interfaces that accept strings of 32-bit characters are not Unicode-conformant. If such an implemen- tation interchanges 16-bit Unicode characters with the outside world, then this interchange can be conformant as long as the interface for this interchange complies with the require- ments of Chapter 3, Conformance.

5.3 Unknown and Missing Characters This section briefly discusses how users or implementers might deal with characters that are not supported, or that, although supported, are unavailable for legible rendering.

Unassigned and Private Use Character Codes There are two classes of character code values that even a “complete” implementation of the Unicode Standard cannot necessarily interpret correctly: • Character code values that are unassigned • Character code values in the Private Use Area for which no private agreement exists An implementation should not attempt to interpret such code values. Options for render- ing such unknown code values include printing the character code value as four hexadeci- mal digits, printing a black or white box, using appropriate glyphs such as ƒ for unassigned and @ for private use, or simply displaying nothing. In no case should an implementation assume anything else about the character’s properties, nor should it blindly delete such characters. It should not unintentionally transform them into some- thing else.

Interpretable but Unrenderable Characters An implementation may receive a character that is an assigned character in the Unicode character encoding, but be unable to render it because it does not have a font for it or is otherwise incapable of rendering it appropriately. In this case, an implementation might be able to provide further limited feedback to the user’s queries such as being able to sort the data properly, show its script, or otherwise dis- play it in a default manner. An implementation can distinguish between unrenderable (but assigned) characters and unassigned code values by printing the former with distinctive glyphs that give some general indication of their type, such as , , , , , , , , , , , and so on.

108 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.4 Handling Surrogate Pairs

Reassigned Characters Some character code values, primarily Hangul, were assigned in versions of the Unicode Standard earlier than 2.0, but have been reassigned because the characters were moved (see the Transcoding Tables on the CD-ROM). Such code values should be recognized and con- verted into the correct Version 3.0 character code values where possible. In some cases, a Version 3.0 application may still need to emit Version 1.1 character codes to communicate with some Version 1.1 applications. This issue is limited to particular early Unicode appli- cations and data sets and is not a general problem. The Unicode Consortium has the policy to never reassign characters in the future.

5.4 Handling Surrogate Pairs Surrogate pairs provide a mechanism for encoding 917,476 characters without requiring the use of 32-bit characters. Because predominantly infrequently used characters will be assigned to surrogate pairs, not all implementations need to handle these pairs initially. It is widely expected that surrogate pairs will be assigned in the not too distant future. High-surrogates and low-surrogates are assigned to disjoint ranges of code positions. Non- surrogate characters can never be assigned to these ranges. Because the high- and low- surrogate ranges are disjoint, determining character boundaries requires at most scanning one preceding or following Unicode code value, without regard to any other context. This approach enables efficient random access, which is not possible with encodings such as Shift-JIS. In well-formed text, a low-surrogate can be preceded only by a high-surrogate and not by a low-surrogate or nonsurrogate. A high-surrogate can be followed only by a low-surrogate and not by a high-surrogate or nonsurrogate. Surrogates are also designed to work well with implementations that do not recognize them. For example, the valid sequence of Unicode characters [0048] [0069] [0020] [D800] [DC00] [0021] [0021] would be interpreted by a Version 1.0-conformant implementation as “ !!” This outcome is only slightly worse than that pro- duced by a Version 3.0-conformant implementation that did not support that particular surrogate pair, and so interpreted the sequence as “Hi !!” As long as an implementation does not remove either surrogate or insert another character between them, the data integrity is maintained. Moreover, even if the data become cor- rupted, the data corruption is localized, unlike with some multibyte encodings such as Shift-JIS or EUC. Corrupting a single Unicode value affects only a single character. Because the high- and low-surrogates are disjoint and always occur in pairs, errors are prevented from propagating through the rest of the text. Implementations can have different levels of support for surrogates, based on two primary issues: • Does the implementation interpret a surrogate pair as the assigned single char- acter? • Does the implementation guarantee the integrity of a surrogate pair?

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 109 5.5 Handling Numbers Implementation Guidelines

The decisions on these issues give rise to three reasonable levels of support for surrogates as shown in Table 5-1. Table 5-1. Surrogate Support Levels

Support Level Interpretation Integrity of Pairs None No pairs Does not guarantee Weak Non-null subset of pairs Does not guarantee Strong Non-null subset of pairs Guarantees

Example. The following sentence could be displayed in three different ways, assuming that both the weak and strong implementations have Phoenician fonts but no hieroglyphics: “The Greek letter α corresponds to and to .” The in Table 5-2 represents any visual represen- tation of an uninterpretable single character by the implementation. Table 5-2. Surrogate Level Examples None “The Greek letter α corresponds to and to .” Weak “The Greek letter α corresponds to and to .” Strong “The Greek letter α corresponds to and to .”

Many implementations that handle advanced features of the Unicode Standard can easily be modified to support a weak surrogate implementation. For example, • Text collation can be handled by treating those surrogate pairs as “grouped characters,” much as “” in Dutch or “ll” in traditional Spanish. • Text entry can be handled by having a keyboard generate two Unicode values with a single keypress, much as an can have a “lam-alef ” key that generates a sequence of two characters, lam and alef. • Character display and measurement can be handled by treating specific surro- gate pairs as ligatures, in the same way as “f ” and “i” are joined to form the sin- gle glyph “€”. • Truncation can be handled with the same mechanism as used to keep combin- ing marks with base characters. (For more information, see Section 5.15, Locat- ing Text Element Boundaries.) Users are prevented from damaging the text if a text editor keeps insertion points (also known as ) on character boundaries. As with text-element boundaries, the lowest- level string-handling routines (such as wcschr) do not necessarily need to be modified to prevent surrogates from being damaged. In practice, it is sufficient that only certain higher- level processes (such as those just noted) be aware of surrogate pairs; the lowest-level rou- tines can continue to function on sequences of 16-bit Unicode code values without having to treat surrogates specially.

5.5 Handling Numbers There are many sets of characters that represent decimal digits in different scripts. Systems that interpret those characters numerically should provide the correct numerical values. For example, the sequence U+0968   , U+0966    when numerically interpreted has the value twenty.

110 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.6 Handling Properties

When converting binary numerical values to a visual form, digits can be chosen from dif- ferent scripts. For example, the value twenty can be represented either by U+0032  , U+0030   or by U+0968   , U+0966   , or by U+0662 -  , U+0660 -  . It is recommended that systems allow users to choose the format of the resulting digits by replacing the appropriate occurrence of U+0030   with U+0660 -  , and so on. (See Chapter 4, Character Properties, for tables providing the infor- mation needed to implement formatting and scanning numerical values.) Fullwidth variants of the ASCII digits are simply compatibility variants of regular digits and should be treated as regular Western digits. The Roman numerals and East Asian ideographic numerals are decimal numeral writing systems, but they are not formally decimal radix digit systems. That is, it is not possible to do a one-to-one transcoding to forms such as 123456.789. Both of them are appropriate only for positive integer writing. It is also possible to write numbers in two ways with ideographic digits. For example, Figure 5-2 shows how the number 1,234 can be written. Figure 5-2. Ideographic Numbers

or

Supporting these digits for numerical parsing means that implementations must be smart about distinguishing between these two cases. Digits often occur in situations where they need to be parsed, but are not part of numbers. One such example is alphanumeric identifiers (see Section 5.16, Identifiers). It is only at a second level (for example, when implementing a full mathematical formula parser) that considerations such as superscripting become crucial for interpretation.

5.6 Handling Properties The Unicode Standard provides detailed information on character properties (see Chapter 4, Character Properties, and the Unicode Character Database on the accompanying CD- ROM). These properties can be used by implementers to implement a variety of low-level processes. Fully language-aware and higher-level processes will need additional informa- tion. A two-stage table, as described in Section 5.1, Transcoding to Other Standards, can also be used to handle mapping to character properties or other information indexed by character code. For example, the data from the Unicode Character Database on the accompanying CD-ROM can be represented in memory very efficiently as a set of two-stage tables. Individual properties are common to large sets of characters and therefore lend themselves to implementations using the shared blocks. Many popular implementations are influenced by the POSIX model, which provides func- tions for separate properties, such as isalpha, isdigit, and so on. Implementers of Unicode-based systems and internationalization libraries need to take care to extend these concepts to the full set of Unicode characters correctly.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 111 5.7 Normalization Implementation Guidelines

In Unicode-encoded text, combining characters participate fully. In addition to providing callers with information about which characters have the combining property, implement- ers and writers of language standards need to provide for the fact that combining charac- ters assume the property of the preceding base character (see also Section 3.5, Combination, and Section 5.16, Identifiers). Other important properties, such as sort weights, may also depend on a character’s context. Because the Unicode Standard provides such a rich set of properties, implementers will find it useful to allow access to several properties at a time, possibly returning a string of bit-fields, one bit-field per character in the input string. In the past, many existing standards, such as the C language standard, assumed very mini- malist “portable character sets” and geared their functions to operations on such sets. As the Unicode encoding itself is increasingly becoming the portable character set, imple- menters are advised to distinguish between historical limitations and true requirements when implementing specifications for particular text processes.

5.7 Normalization Alternative Spellings. The Unicode Standard contains explicit codes for the most fre- quently used accented characters. These characters can also be composed; in the case of accented letters, characters can be composed from a base character and nonspacing mark(s). The Unicode Standard provides a table of normative spellings (decompositions) of charac- ters that can be composed using a base character plus one or more nonspacing marks. These tables can be used to unify spelling in a standard manner in accordance with Section 3.10, Canonical Ordering Behavior. Implementations that are “liberal” in what they accept, but “conservative” in what they issue, will have the fewest compatibility problems. • The decomposition mappings are specific to a particular version of the Unicode Standard. Changes may occur as the result of character additions in the future. When new precomposed characters are added, mappings between those char- acters and corresponding composed character sequences will be added. Simi- larly, if a new combining mark is added to this standard, it may allow decompositions for precomposed characters that did not have decompositions before. Normalization. Systems may normalize Unicode-encoded text to one particular sequence, such as normalizing composite character sequences into precomposed characters, or vice versa (see Figure 5-3). Figure 5-3. Normalization Unnormalized a ¨ · ë ˜ ò

˜ ä· ëò a ¨¨· e ˜ o ` Precomposed Decomposed

112 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.8 Compression

Compared to the number of possible combinations, only a relatively small number of pre- composed base character plus nonspacing marks have independent Unicode character val- ues; most existed in dominant standards. Systems that cannot handle nonspacing marks can normalize to precomposed characters; this option can accommodate most modern Latin-based languages. Such systems can use fallback rendering techniques to at least visually indicate combinations that they cannot handle (see the “Fallback Rendering” subsection of Section 5.14, Rendering Nonspacing Marks). In systems that can handle nonspacing marks, it may be useful to normalize so as to elimi- nate precomposed characters. This approach allows such systems to have a homogeneous representation of composed characters and maintain a consistent treatment of such char- acters. However, in most cases, it does not require too much extra work to support mixed forms, which is the simpler route. The standard forms for normalization and for conformance to those forms are defined in Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site. For further information see Chapter 3, Con- formance; Chapter 4, Character Properties; and Section 2.6, Combining Characters.

5.8 Compression Using the Unicode character encoding may increase the amount of storage or memory space dedicated to the text portion of files. Compressing Unicode-encoded files or strings can therefore be an attractive option. Compression always constitutes a higher-level protocol and makes interchange dependent on knowledge of the compression method employed. For a detailed discussion on compression and a standard compression scheme for Unicode, see Unicode Technical Report #6, “A Standard Compression Scheme for Uni- code,” on the CD-ROM or the up-to-date version on the Unicode Web site. Encoding forms defined in Section 2.3, Encoding Forms, have different storage characteris- tics. For example, as long as text contains only characters from the Basic Latin (ASCII) block, it occupies the same amount of space whether it is encoded with the UTF-8 transfor- mation format or with ASCII codes. On the other hand, text consisting of ideographs encoded with UTF-8 will require more space than equivalent Unicode-encoded text.

5.9 Line Handling are represented on different platforms by CR, LF, NL, or CRLF. Not only are new- lines represented by different characters on different platforms, but they also have ambigu- ous behavior even on the same platform. With the advent of the Web, where text on a single machine can arise from many sources, this inconsistency causes a significant problem. Unfortunately, these characters are often transcoded directly into the corresponding Uni- code code values when a character set is transcoded. For this reason, even programs han- dling pure Unicode text must deal with the problems. For detailed guidelines, see Unicode Technical Report #13, “Unicode Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 113 5.10 Regular Expressions Implementation Guidelines

5.10 Regular Expressions Byte-oriented regular expression engines require extensions to successfully handle Uni- code. The following issues are involved in such extensions: • Unicode is a large character set—regular expression engines that are adapted to handle only small character sets may not scale well. • Unicode encompasses a wide variety of languages that can have very different characteristics than English or other Western European text. For detailed information on the requirements of Unicode Regular Expressions, see Unicode Technical Report #18, “Unicode Regular Expression Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.

5.11 Language Information in Plain Text

Requirements for Language Tagging The requirement for language information embedded in plain text data is often overstated. Many commonplace operations such as collation seldom require this extra information. In collation, for example, foreign language text is generally collated as if it were not in a for- eign language. (See Unicode Technical Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site for more information.) For example, an index in an English book would not sort the Spanish word “churo” after “czar,” where it would be collated in traditional Spanish, nor would an English atlas put the Swedish city of Östersjö after Zanzibar, where it would appear in Swedish. However, language information is very useful in certain operations, such as spell-checking or hyphenating a mixed-language document. It is also useful in choosing the default font for a run of unstyled text; for example, the ellipsis character may have a very different appearance in Japanese fonts than in European fonts. Although language information is useful in performing text-to-speech operations, modern software for doing acceptable text- to-speech must be so sophisticated in performing grammatical analysis of text that the extra work in determining the language is not significant. Language information can be presented as out-of-band information or inline tags. In inter- nal implementations, it is quite common to use out-of-band information, which is stored in data structures that are parallel to the text, rather than embedded in it. Out-of-band information does not interfere with the normal processing of the text (comparison, search- ing, and so on) and more easily supports manipulation of the text. For interchange purposes, it is becoming common to use tagged information, which is embedded in the text. Unicode Technical Report #7, “Plane 14 Characters for Language Tags,” which is found on the CD-ROM or in its up-to-date version on the Unicode Web site, provides a proposed mechanism for representing language tags. Like most tagging mechanisms, these language tags are stateful: a start tag establishes an attribute for the text, and an end tag concludes it.

Working with Language Tags Avoiding Language Tags. Because of the extra implementation burden, language tags should be avoided in plain text unless language information is required and it is known that the receivers of the text will properly recognize and maintain the tags. However, where

114 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.11 Language Information in Plain Text language tags must be used, implementers should consider the following implementation issues involved in supporting language information with tags and decide how to handle tags where they are not fully supported. This discussion applies to any mechanism for pro- viding language tags in a plain text environment. Display. Language tags themselves are not displayed. This choice may not require modifi- cation of the displaying program, if the fonts on that platform have the language tag char- acters mapped to zero-width, invisible glyphs. Language tags may need to be taken into account for special processing, such as hyphenation or choice of font. Processing. Sequential access to the text is generally straightforward. If language codes are not relevant to the particular processing operation, then they should be ignored. Random access to stateful tags is more problematic. Because the current state of the text depends upon tags previous to it, the text must be searched backward, sometimes all the way to the start. With these exceptions, tags pose no particular difficulties as long as no modifications are made to the text. Editing and Modification. Inline tags present particular problems for text changes, because they are stateful. Any modifications of the text are more complicated, as those modifications need to be aware of the current language status and the ... tags must be properly maintained. If an editing program is unaware that certain tags are stateful and cannot process them correctly, then it is very easy for the user to modify text in ways that corrupt it. For example, a user might delete part of a tag or paste text including a tag into the wrong context. Dangers of Incomplete Support. Even programs that do not interpret the tags should not allow editing operations to break initial tags or leave tags unpaired. Unpaired tags should be discarded upon a save or send operation. Nonetheless, malformed text may be produced and transmitted by a tag-unaware editor. Therefore, implementations that do not ignore language tags must be prepared to receive malformed tags. On reception of a malformed or unpaired tag, language tag-aware imple- mentations should reset the language to NONE, and then ignore the tag. Higher-Level Protocols. Higher-level protocols such as HTML or MIME may also supply language tags. Language tags should be avoided wherever higher-level protocols, such as a rich-text format, provide language information. Not only does this approach avoid prob- lems, but it also avoids cases where the higher-level protocol and the language tags disagree.

Language Tags and Han Unification Han Unification. A common misunderstanding about Unicode Han Unification is the mistaken belief that Han characters cannot be rendered properly without language infor- mation. This idea might lead an implementer to conclude that language information must always be added to plain text using the tags. However, this implication is incorrect. The goal and methods of Han Unification were to ensure that the text remained legible. Although font, size, width, and other format specifications need to be added to produce precisely the same appearance on the source and target machines, plain text remains legible in the absence of these specifications. There should never be any confusion in Unicode, because the distinctions between the uni- fied characters are all within the range of stylistic variations that exist in each country. No unification in Unicode should make it impossible for a reader to identify a character if it appears in a different font. Where precise font information is important, it is best conveyed in a rich-text format. Typical Scenarios. The following e-mail scenarios illustrate that the need for language information with Han characters is often overstated:

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 115 5.12 Editing and Selection Implementation Guidelines

• Scenario 1. A Japanese user sends out untagged Japanese text. Readers are Japa- nese (with Japanese fonts). Readers see no differences from what they expect. • Scenario 2. A Japanese user sends out an untagged mixture of Japanese and Chinese text. Readers are Japanese (with Japanese fonts) and Chinese (with Chinese fonts). Readers see the mixed text with only one font, but the text is still legible. Readers recognize the difference between the languages by the con- tent. • Scenario 3. A Japanese user sends out a mixture of Japanese and Chinese text. Text is marked with font, size, width, and so on because the exact format is important. Readers have the fonts and other display support. Readers see the mixed text with different fonts for different languages. They recognize the dif- ference between the languages by the content, and see the text with glyphs that are more typical for the particular language. It is common even in printed matter to render passages of foreign language text in native- language fonts, just for familiarity. For example, Chinese text in a Japanese document can commonly be rendered in a Japanese font.

5.12 Editing and Selection

Consistent Text Elements A user interface for editing is most intuitive when the text elements are consistent (see Figure 5-4). In particular, the editing actions of deletion, selection, mouse-clicking, and cursor-key movement should act as though they have a consistent set of boundaries. For example, hitting a leftward-arrow should result in the same cursor location as delete. This synchronization gives a consistent, single model for editing characters. Figure 5-4. Consistent Character Boundaries

Cluster ·æÕÐü Rôle s* Stack ·æÕÐü Rôle s* Atomic ·æÕÐü Rôle s*

Three types of boundaries are generally useful in editing and selecting within words. Cluster Boundaries. Cluster boundaries occur in scripts such as Devanagari. Selection or deletion using cluster boundaries means that an entire cluster (such as ka + vowel sign a) or a composed character (o + circumflex) is selected or deleted as a single unit. Stacked Boundaries. Stacked boundaries are generally somewhat finer than cluster bound- aries. Free-standing elements (such as vowel sign a) can be independently selected and deleted, but any elements that “stack” (such as o + circumflex, or vertical ligatures such as Arabic lam + meem) can be selected only as a single unit. Stacked boundaries treat all com- posed character sequences as single entities, much like precomposed characters.

116 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.13 Strategies for Handling Nonspacing Marks

Atomic Character Boundaries. The use of atomic character boundaries is closest to selec- tion of individual Unicode characters. However, most modern systems indicate selection with some sort of rectangular highlighting. This approach places restrictions on the consis- tency of editing because some sequences of characters do not linearly progress from the start of the line. When characters stack, two mechanisms are used to visually indicate par- tial selection: linear and nonlinear boundaries. Linear Boundaries. Use of linear boundaries treats the entire width of the resultant glyph as belonging to the first character of the sequence, and the remaining characters in the backing-store representation as having no width and being visually afterward. This option is the simplest mechanism and one that is currently in use on the Macintosh and some other systems. The advantage of this system is that it requires very little addi- tional implementation work. The disadvantage is that it is never easy to select narrow char- acters, let alone a zero-width character. Mechanically, it requires the user to select just to the right of the nonspacing mark and drag just to the left. It also does not allow the selec- tion of individual nonspacing marks if more than one are present. Nonlinear Boundaries. Use of linear boundaries divides any stacked element into parts. For example, picking a point halfway across a lam + meem ligature can represent the divi- sion between the characters. One can either allow highlighting with multiple rectangles or use another method such as coloring the individual characters. Notice that with more work, a precomposed character can behave in deletion as if it were a composed character sequence with atomic character boundaries. This procedure involves deriving the character’s decomposition on the fly to get the components to be used in sim- ulation. For example, deletion occurs by decomposing, removing the last character, then recomposing (if more than one character remains). However, this technique does not work in general editing and selection. In most systems, the character is the smallest addressable item in text, so the selection and assignment of properties (such as font, color, letterspacing, and so on) are done on a per- character basis. There is no good way to simulate this addressability with precomposed characters. Systematically modifying all text editing to address parts of characters would be quite inefficient. Just as there is no single notion of text element, so there is no single notion of editing char- acter boundaries. At different times, users may want different degrees of granularity in the editing process. Two methods suggest themselves. First, the user may set a global preference for the character boundaries. Second, the user may have alternative command mecha- nisms, such as Shift-Delete, which give more (or less) fine control than the default mode.

5.13 Strategies for Handling Nonspacing Marks By following these guidelines, a programmer should be able to implement systems and routines that provide for the effective and efficient use of nonspacing marks in a wide variety of applications and systems. The programmer also has the choice between minimal techniques that apply to the vast majority of existing systems and more sophisticated tech- niques that apply to more demanding situations, such as higher-end DTP (desktop pub- lishing). In this section and the following section, the terms nonspacing mark and combining charac- ter are used interchangeably. The terms diacritic, accent, stress mark, Hebrew point, Arabic vowel, and others are sometimes used instead of nonspacing mark. (They refer to particular types of nonspacing marks.)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 117 5.13 Strategies for Handling Nonspacing Marks Implementation Guidelines

A relatively small number of implementation features are needed to support nonspacing marks. Different possible levels of implementation are also possible. A minimal system yields good results and is relatively simple to implement. Most of the features required by such a system are modifications of existing software. As nonspacing marks are required for a number of languages such as Arabic, Hebrew, and the languages of the Indian subcontinent, many vendors already have systems capable of dealing with these characters and can use their experience to produce general-purpose soft- ware for handling these characters in the Unicode Standard. Rendering. A fixed set of composite character sequences can be rendered effectively by means of fairly simple substitution. Wherever a sequence of base character plus one or more nonspacing combining marks occurs, a glyph representing the combined form can be substituted. In simple, monospaced character rendering, a nonspacing combining mark has a zero advance width, and a composite character sequence will have the same width as the base character. When truncating strings, it is always easiest to truncate starting from the end and working backward. A trailing nonspacing mark will then not be separated from the preceding base character. A more sophisticated rendering system can take into account more subtle variations in widths and kerning with nonspacing marks or account for those cases where the composite character sequence has a different advance width than the base character. Such rendering systems are not necessary for the large majority of applications. They can, however, also supply more sophisticated truncation routines. (See also Section 5.14, Rendering Nonspac- ing Marks.) Other Processes. Correct multilingual comparison routines must already be able to com- pare a sequence of characters as one character, or one character as if it were a sequence. Such routines can also handle composite character sequences when supplied the appropri- ate data. When searching strings, remember to check for additional nonspacing marks in the target string that may affect the interpretation of the last matching character. Line-break algorithms generally use state machines for determining word breaks. Such algorithms can be easily adapted to prevent separation of nonspacing marks from base characters. (See also the discussion in Section 5.17, Sorting and Searching; Section 5.7, Nor- malization; and Section 5.15, Locating Text Element Boundaries.)

Keyboard Input A common implementation for the input of composed character sequences is the use of so- called dead keys. These keys match the mechanics used by to generate such sequences through overtyping the base character after the nonspacing mark. In computer implementations, keyboards enter a special state when a is pressed for the accent and emit a precomposed character only when one of a limited number of “legal” base char- acters is entered. It is straightforward to adapt such a system to emit composed character sequences or precomposed characters as needed. Although typists, especially in the Latin script, are trained on systems working in this way, many scripts in the Unicode Standard (including the Latin script) may be implemented according to the handwriting sequence, in which users type the base character first, followed by the accents or other nonspacing marks (see Figure 5-5). In the case of handwriting sequence, each keystroke produces a distinct, natural change on the screen; there are no hidden states. To add an accent to any existing character, the user positions the insertion point () after the character and types the accent.

118 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.13 Strategies for Handling Nonspacing Marks

Figure 5-5. Dead Keys Versus Handwriting Sequence Dead Key Handwriting Zrich Zrich ¨ u Zrich Zurich u ¨ Zürich Zürich

Truncation There are two types of truncation: truncation by character count and truncation by dis- played width. Truncation by character count can entail loss (be lossy) or be lossless. Truncation by character count is used where, due to storage restrictions, a limited number of characters can be entered into a field; it is also used where text is broken into buffers for transmission and other purposes. The latter case can be lossless if buffers are recombined seamlessly before processing or if lookahead is performing for possible combining charac- ter sequences straddling buffers. When fitting data into a field of limited length, some information will be lost. Truncating at a text element boundary (for example, on the last composite character sequence boundary or even last word boundary) is often preferable to truncating after the last code value (see Figure 5-6). (See Section 5.15, Locating Text Element Boundaries.) Figure 5-6. Truncating Composed Character Sequences On CCS Jose« boundaries

Clipping JosŽ

Ellipsis Jo...

Truncation by displayed width is used for visual display in a narrow field. In this case, trun- cation occurs on the basis of the width of the resulting string rather than on the basis of a character count. In simple systems, it is easiest to truncate by width, starting from the end and working backward by subtracting character widths as one goes. Because a trailing non- spacing mark does not contribute to the measurement of the string, the result will not sep- arate nonspacing marks from their base characters. If the textual environment is more sophisticated, the widths of characters may depend on their context, due to effects such as kerning, ligatures, or contextual formation. For such systems, the width of a composed character, such as an ï, may be different than the width of a narrow base character alone. To handle these cases, a final check should be made on any truncation result derived from successive subtractions.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 119 5.14 Rendering Nonspacing Marks Implementation Guidelines

A different option is simply to clip the characters graphically. However, the result may look ugly. Also, if the clipping occurs between characters, it may not give any visual feedback that characters are being omitted. A graphic or ellipsis can be used to give this visual feed- back.

5.14 Rendering Nonspacing Marks This discussion assumes the use of proportional fonts, where the widths of individual char- acters can vary. Various techniques can be used with monospaced fonts, but in general, it is possible to get only a semblance of a correct rendering in these fonts, especially with inter- national characters. When rendering a sequence consisting of more than one nonspacing mark, the nonspacing marks should, by default, be stacked outward from the base character. That is, if two non- spacing marks appear over a base character, then the first nonspacing mark should appear on top of the base character, and the second nonspacing mark should appear on top of the first. If two nonspacing marks appear under a base character, then the first nonspacing mark should appear beneath the base character, and the second nonspacing mark should appear below the first (see Section 2.6, Combining Characters). This default treatment of multiple, potentially interacting nonspacing marks is known as the inside-out rule (see Figure 5-7). Figure 5-7. Inside-Out Rule Characters Glyphs a @¬÷@ @. @ ä÷. ö ö This default behavior may be altered based on typographic preferences or on knowledge of the specific orthographic treatment to be given to multiple nonspacing marks in the con- text of a particular . For example, in the modern Vietnamese writing system, an acute or grave accent (serving as a tone mark) may be positioned slightly to one side of a circumflex accent rather than directly above it. If the text to be displayed is known to employ a different typographic convention (either implicitly through knowledge of the language of the text or explicitly through rich-text style bindings), then an alternative posi- tioning may be given to multiple nonspacing marks instead of that specified by the default inside-out rule. Fallback Rendering. Several methods are available to deal with an unknown composed character sequence that is outside of a fixed, renderable set (see Figure 5-8). One method (Show Hidden) indicates the inability to draw the sequence by drawing the base character first and then rendering the nonspacing mark as an individual unit—with the nonspacing mark positioned on a dotted circle. (This convention is used in Chapter 14, Code Charts.) Figure 5-8. Fallback Rendering ˆ @ @ Ggˆ Gˆˆg xGˆ xgˆ x “Ideal”“Show “Simple Hidden” Overlap”

120 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.14 Rendering Nonspacing Marks

Another method (Simple Overlap) uses default fixed positioning for an overlapping zero- width nonspacing mark, generally placed far away from possible base characters. For exam- ple, the default positioning of a circumflex can be above the ascent, which will place it above capital letters. Even though the result will not be particularly attractive for letters such as g-circumflex, the result should generally be recognizable in the case of single non- spacing marks. In a degenerate case, a nonspacing mark occurs as the first character in the text or is sepa- rated from its base character by a line separator, paragraph separator, or other formatting character that causes a positional separation. This result is called a defective combining character sequence (see Section 3.5, Combination). Defective combining character sequences should be rendered as if they had a space as a base character. Bidirectional Positioning. In bidirectional text, the nonspacing marks are reordered with their base characters; that is, they visually apply to the same base character after the algo- rithm is used (see Figure 5-9). There are a few ways to accomplish this positioning. Figure 5-9. Bidirectional Placement Backing Store g @ö U @V Screen Order g @ö @V U Glyph Metrics ö V xxg x x x x xUx

Aligned Glyphs ö xgx xUV x

The simplest method is similar to the Simple Overlap fallback method. In the bidirectional algorithm, combining marks take the level of their base character. In that case, Arabic and Hebrew nonspacing marks would come to the left of their base characters. The font is designed so that instead of overlapping to the left, the Arabic and Hebrew nonspacing marks overlap to the right. In Figure 5-9, the “glyph metrics” line shows the pen start and end for each glyph with such a design. After aligning the start and end points, the final result shows each nonspacing mark attached to the corresponding base letter. More sophis- ticated rendering could then apply the positioning methods outlined in the next section. With some rendering software, it may be necessary to keep the nonspacing mark glyphs consistently ordered to the right of the base character glyphs. In that case, a second pass can be done after producing the “screen order” to put the odd-level nonspacing marks on the right of their base characters. As the levels of nonspacing marks will be the same as their base characters, this pass can swap the order of nonspacing mark glyphs and base character glyphs in right-left (odd) levels. (See Section 3.12, Bidirectional Behavior.)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 121 5.14 Rendering Nonspacing Marks Implementation Guidelines

Justification. Typically, full justification of text adds extra space at space characters so as to widen a line; however, if there are too few (or no) space characters, some systems add extra letterspacing between characters (see Figure 5-10). This process needs to be modified if zero-width nonspacing marks are present in the text. Otherwise, the nonspacing marks will be separate from their base characters. Figure 5-10. Justification Zürich 66 points/6 positions Zrichu¨ = 11 points per position 66 points/5 positions Z ü rich = 13.2 points per position

Because nonspacing marks always follow their base character, proper justification adds let- terspace between characters only if the second character is a base character.

Positioning Methods A number of different methods are available to position nonspacing marks so that they are in the correct location relative to the base character and previous nonspacing marks. Positioning with Ligatures. A fixed set of composed character sequences can be rendered effectively by means of fairly simple substitution (see Figure 5-11). Wherever the glyphs representing a sequence of occur, a glyph representing the combined form is substituted. Because the nonspacing mark has a zero advance width, the composed character sequence will automatically have the same width as the base char- acter. (More sophisticated text rendering systems may take further measures to account for those cases where the composed character sequence kerns differently or has a slightly dif- ferent advance width than the base character.) Figure 5-11. Positioning with Ligatures a + @¬ ➠ ä A + @¬ ➠ Ä (f + i ➠ fi)

Positioning with ligatures is perhaps the simplest method of supporting nonspacing marks. Whenever there is a small, fixed set, such as those corresponding to the precomposed char- acters of 8859-1 (Latin1), this method is straightforward to apply. Because the composed character sequence almost always has the same width as the base character, rendering, mea- surement, and editing of these characters are much easier than for the general case of liga- tures. If a composed character sequence does not form a ligature, then one of the two following methods can be applied. If they are not available, then a fallback method can be used.

122 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.14 Rendering Nonspacing Marks

Positioning with Contextual Forms. A more general method of dealing with positioning of nonspacing marks is to use contextual formation (see Figure 5-12). In this case, several different glyphs correspond to different positions of the accents. Base glyphs generally fall into a fairly small number of classes, based upon their general shape and width. According to the class of the base glyph, a particular glyph is chosen for a nonspacing mark. Figure 5-12. Positioning with Contextual Forms ·ü ➠ ·F Åü ➠ ÅZ Ðü = Ðü

In general cases, a number of different heights of glyphs can be chosen to allow stacking of glyphs, at least for a few deep (when these bounds are exceeded, then the fallback methods can be used). This method can be combined with the ligature method so that in specific cases ligatures can be used to produce fine variations in position and shape. Positioning with Enhanced Kerning. A third technique for positioning diacritics is an extension of the normal process of kerning to be both horizontal and vertical (see Figure 5-13). Typically, kerning maps from pairs of glyphs to a positioning offset. For example, in the word “To” the “o” should nest slightly under the “T”. An extension of this system maps to both a vertical and a horizontal offset, allowing glyphs to be arbitrarily posi- tioned. Figure 5-13. Positioning with Enhanced Kerning To w« To w«

For effective use in the general case, the kerning process must also be extended to handle more than simple kerning pairs, as multiple diacritics may occur after a base letter. Positioning with enhanced kerning can be combined with the ligature method so that in specific cases ligatures can be used to produce fine variations in position and shape.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 123 5.15 Locating Text Element Boundaries Implementation Guidelines

5.15 Locating Text Element Boundaries A string of Unicode-encoded text often needs to be broken up into text elements program- matically. Common examples of text elements include what users think of as characters, words, lines, and sentences. The precise determination of text elements may vary according to locale, even as to what constitutes a character. The goal of matching user perceptions cannot always be met because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E  ) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, or at least not surprise the user. Rather than concentrate on algorithmically searching for text elements themselves, a sim- pler computation looks instead at detecting the boundaries between those text elements. The determination of those boundaries is often critical to the performance of general soft- ware, so it is important to be able to make such a determination as quickly as possible. The following boundary determination mechanism provides a straightforward and effi- cient way to determine word boundaries. It builds upon the uniform character representa- tion of the Unicode Standard, while handling the large number of characters and special features such as combining marks and surrogates in an effective manner. As this boundary determination mechanism lends itself to a completely data-driven implementation, it can be customized to particular locales according to special language or user requirements without recoding. In particular, word, line, and sentence boundaries will need to be cus- tomized according to locale and user preference. In Korean, for example, lines may be bro- ken either at spaces (as in Latin text) or on ideograph boundaries (as in Chinese). For some languages, this simple method is not sufficient. For example, Thai linebreaking requires the use of dictionary lookup, analogous to English hyphenation. An implementa- tion therefore may need to provide means to override or subclass the standard, fast mecha- nism described in the “Boundary Specification” subsection in this section. The large character set of the Unicode Standard and its representational power place requirements on both the specification of text element boundaries and the underlying implementation. The specification needs to allow for the designation of large sets of char- acters sharing the same characteristics (for example, uppercase letters), while the imple- mentation must provide quick access and matches to those large sets. The mechanism also must handle special features of the Unicode Standard, such as com- bining or nonspacing marks, conjoining jamo, and surrogate characters. The following discussion looks at two aspects of text element boundaries: the specification and the underlying implementation. Specification means a way for programmers and localizers to specify programmatically where boundaries can occur.

Boundary Specification A boundary specification defines different classes, then lists the rules for boundaries in terms of those classes. Remember that characters may be represented by a sequence of two surrogates. The character classes are specified as a list, where each element of the list is • A literal character • A range of characters • A property of a Unicode character, as defined in the Unicode Character Data- base

124 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries

• A removal of the following elements from the list For the formal description of the notation used in this section, see Section 0.2, Notational Conventions. Additional notational conventions used in this section are as follows: ÷ Allow break here. × Do not allow break here. An (“_”) is used to indicate a space in examples. There is always a boundary at the beginning and at the end of a string of text. As with typi- cal regular expressions, the longest match possible is used. For example, in the following, the boundary is placed after as many Y’s as possible: X Y* ÷ ¬ X (In this book, the rules are numbered for reference.) Some additional constraints are reflected in the specification. These constraints make the implementation significantly simpler and more efficient and have not been found to be limitations for natural language use. 1. Limited context. Given boundaries at positions X and Y, then the positions of any other boundaries between X and Y do not depend on characters outside of X and Y (so long as X and Y remain boundary positions). For example, with boundaries at “ab÷cde”, changing “a” to “A” cannot intro- duce a new boundary, such as at “Ab÷cd÷e”. 2. Single boundaries. Each rule has exactly one boundary position. Because of rule (1), this restriction is more a limitation on the specification methods, because a rule with two boundaries could generally be expressed as two rules.

For example, “ab÷cd÷ef” could be broken into “ab÷cd” and “cd÷ef”.

3. No conflicts. Two rules cannot have initial portions that match the same text, but with different boundary positions. For example, “x÷abc” and “a÷bc” cannot be part of the same boundary specifi- cation. 4. No overlapping sets. For efficiency, two character sets in a specification cannot intersect. A later character set definition will override a previous one, removing its characters from the previous set.

For example, the second set specification removes “AEIOUaeiou” from the first:

Let = [Lu][Ll][Lt][Lm][Lo] EngVowel = AEIOUaeiou 5. No more than 256 sets. This restriction is purely an implementation detail to save on storage. 6. Ignore degenerates. Implementations need not make special provisions to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark. These rules can also be approximated with pair tables, where only one character before, and one character after, a possible break point are considered. Although this approach does not generally give the correct results, it is often sufficient if languages are restricted. How- ever, the grapheme boundaries are designed to be fully expressed using pair tables.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 125 5.15 Locating Text Element Boundaries Implementation Guidelines

Example Specifications Different issues are present with different types of boundaries, as the following discussion and examples should make clear. In this section, the rules are somewhat simplified, and not all edge cases are included. In particular, characters such as control characters and format characters do not cause breaks. Permitting such cases would complicate each of the exam- ples (and is left as an exercise for the reader). In addition, it is intended that the rules them- selves be localizable; the examples provided here are not valid for all locales. Rather than listing the contents of the character sets in these examples, the contents are simply explained. If an implementation uses a state table, the performance does not depend on the complex- ity or number of rules. The only feature that does affect performance is the number of characters that may match after the boundary position in a rule that is matched.

Grapheme Boundaries A grapheme is a minimal unit of a writing system, just as a is a minimal unit of a sound system. In some instances, a grapheme is represented by more than one Unicode character, such as in Indic languages. As far as a user is concerned, the underlying represen- tation of text is not important, but it is paramount that an editing interface present a uni- form implementation of what the user thinks of as graphemes. Graphemes should behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. For example, if an accented character is represented by a combining character sequence, then using the right arrow key should skip from the start of the base character to the end of the last combining character. This situation is analogous to a system using conjoining jamo to represent Hangul syllables, where they are treated as single graphemes for editing. In those circumstances where end users need character counts (which is actually rather rare), the counts need to correspond to the users’ perception of what constitutes a grapheme. The principal requirements for general character boundaries are the handling of combin- ing marks, Hangul conjoining jamo, and Indic and Tibetan character clusters. See Table 5-3. Table 5-3. Grapheme Boundaries

Character Classes

CR Carriage Return LF Line Feed Format All other control or format characters Virama Indic viramas Joining All combining characters, plus Tibetan subjoined characters L Hangul leading jamo V Hangul vowel jamo T Hangul trailing jamo Lo Other letters Other All other characters

126 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries

Table 5-3. Grapheme Boundaries (Continued)

Rules Always break before an LF, unless it is preceded by a CR. Always break before and after all other format or control characters. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷( CR | Format ) (3) ( Format | LF )÷ (4)

Break before viramas and joining characters only if they are preceded by con- trols or formats. ( Format | CR | LF ) ÷ ( Virama | Joining ) (5)

Break before and after Hangul jamo unless they are in an allowable sequence. ¬ L ÷ L (6) ¬ ( L | V )÷V (7) ¬ ( L | V | T )÷T (8) ( L | V | T )÷¬ ( L | V | T )(9)

Don’t break between viramas and other letters. This rule provides for Indic graphemes, where virama will link character clusters together. Virama × Lo (10)

Break before any other characters. ÷ Other (11)

Word Boundaries Word boundaries are used in a number of different contexts. The most familiar ones are double-click mouse selection, “move to next word,” and detection of whole words for search and replace. For the search and replace option of “find whole word,” the rules are fairly clear. The boundaries are between letters and nonletters. Trailing spaces cannot be counted as part of a word, because searching for “abc” would then fail on “abc_”. For word selection, the rules are somewhat less clear. Some programs include trailing spaces, whereas others include neighboring punctuation. Where words do not include trailing spaces, sometimes programs treat the individual spaces as separate words; other times they treat an entire string of spaces as a single word. (The latter fits better with usage in search and replace.) • Word boundaries can also be used in so-called intelligent cut and paste. With this feature, if the user cuts a piece of text on word boundaries, adjacent spaces are collapsed to a single space. For example, cutting “quick” from “The_quick_fox” would leave “The_ _fox”. Intelligent cut and paste collapses this text to “The_fox”.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 127 5.15 Locating Text Element Boundaries Implementation Guidelines

This discussion outlines the case where boundaries occur between letters and nonletters, and there are no boundaries between nonletters. It also includes Japanese words (for word selection). In Japanese, words are not delimited by spaces. Instead, a heuristic rule is used in which strings of either kanji (ideographs) or katakana characters (optionally followed by strings of hiragana characters) are considered words. See Table 5-4. Table 5-4. Word Boundaries

Character Classes

CR Carriage Return LF Line Feed Sep LS, PS TAB Tab character Let Letter Com Combining mark Hira Hiragana Kata Katakana Han Han ideograph (Kanji)

Rules Always break before an LF, unless it is preceded by a CR. Always break before and after all other format or control characters, including TAB. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷ ( CR | Sep | TAB )(3) ( Sep | TAB | LF ) ÷ (4)

Break between letters and nonletters. Include trailing nonspacing marks as part of a letter. ¬ ( Let | Com ) ÷ Let (5) Let Com* ÷¬ Let (6)

Handle Japanese specially for word selection. Treat clusters of kanji or kata- kana (with or without following hiragana) as single words. Break when pre- ceded by other characters (such as punctuation). Include nonspacing marks. ¬ ( Hira | Kata | Han | Com ) ÷ Hira | Kata | Han (7) Hira Com* ÷¬ Hira (8) Kata Com* ÷¬ ( Hira | Kata ) (9) Han Com* ÷¬ ( Hira | Han ) (10)

• One could also generally break between any letters of different scripts. In prac- tice—except for languages that do not use spaces—this possibility is a degener- ate case.

128 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries

Line Boundaries Line boundaries determine acceptable locations for line-wrap to occur without hyphen- ation. (More sophisticated line-wrap also makes use of hyphenation, but generally only in cases where the natural line-wrap yields inadequate results.) Note that this approach is very similar to word boundaries but not generally identical. For the purposes of line-break, a composed character sequence should generally be treated as though it had the same properties as the base character. The nonspacing marks should not be separated from the base character. Nonspacing marks may be exhibited in isolation—that is, over a space or nonbreaking space. In that case, the entire composed character sequence is treated as a unit. If the com- posed character sequence consists of a no-break space followed by nonspacing marks, then it does not generally allow line-breaks before or after the sequence. If the composed charac- ter sequence consists of any other space followed by nonspacing marks, then it generally does allow line-breaks before or after the sequence. There is a degenerate case where a nonspacing mark occurs as the first character in the text or after a line or paragraph separator. In that case, the most consistent treatment for the line-break is to treat the nonspacing mark as though it were applied to a space. See Table 5-5. Table 5-5. Line Boundaries

Character Classes

CR Carriage Return LF Line Feed ZWSP Zero-width space ZWNBSP Zero-width no-break space Sp Spaces Break Mandatory break (for example, Paragraph Separator) Com All combining characters (including Tibetan subjoined characters), plus medial and final conjoining Hangul jamo Ideographic Ideographic characters Alphabetic Alphabetic characters and most symbols Exclam Terminating characters like exclamation point Syntax Solidus (‘/’) Open Open punctuation, such as ‘(’ Close Closing , such as ‘)’ Quote Ambiguous quotations marks, such as " and ' NonStarter Small hiragana and katakana characters HyphenMinus U+002D Insep Ellipsis characters and leaders Number Digits NumericPrefix Characters such as ‘$’ NumericPostfix Characters such as ‘%’ NumericPrefix Characters such as ‘$’ NumericPostfix Characters such as ‘%’ NumericInfix Characters such as period and comma that occur with- in European numbers

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 129 5.15 Locating Text Element Boundaries Implementation Guidelines

Table 5-5. Line Boundaries (Continued)

Base Base characters NonBase Nonbase characters (control characters and format characters) All All Unicode characters Note: For a precise specification of these classes, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Unicode Web site. (The classes have slightly different names here for consistency.) Rules Explicit breaks and nonbreaks: • Always break after hard line-breaks (but never between CR and LF). There is a break opportunity after every ZWSP, but not a hard break. • Do not break before spaces or hard line-breaks. • Do not break before or after ZWNBSP. CR × LF (1) ( CR | LF | Break | ZWSP ) ÷ (2) × ( SP | CR | LF | Break | ZWSP | ZWNBSP )(3) ZWNBSP × (4) Combining marks: • Do not break graphemes (before combining marks). Virama and jamos are merged with the proper classes so they work correctly. In all of the following rules: • If a space is the base character for a combining mark, the space is changed to type AL. • At any possible break opportunity between Com and a following character, Com behaves as if it had the type of its base character. If there is no base, the Com behaves like AL. × Com (5) Sp Com ò Alphabetic Com (6) Base Com ò Base Base (7) NonBase Com ò NonBase Alphabetic (8) Handle opening and closing: These have special behavior with respect to spaces. • Do not break before exclamations, syntax characters, or closing punctuation, even after spaces.

130 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries

Table 5-5. Line Boundaries (Continued)

• Do not break after opening punctuations, even after spaces. • Do not break within quotations and opening punctuations, or between closing punctuation and ideographs, even with inter- vening spaces. × ( Exclam | Syntax | Close ) (9) Open × (10) Quote × Open (11) Close × Ideographic (12) Once opening and closing are handled, spaces can be finished up. • Break after spaces Sp ÷ (13) Special case rules: • Do not break before or after ambiguous quotation marks. • Do not break before no-starts or HyphenMinus. • Do not break between two ellipses, or between letters or numbers and ellipsis, such as in ‘9...’, ‘a...’, ‘H...’. • Do not break within alphabetics and numbers, or numbers and alphabetics, or ideographs and numeric suffixes. × ( Quote | NonStarter | HyphenMinus ) (14) Quote × (15) ( Insep | Number | Ideographic | Alphabetic ) × Insep (16) Alphabetic × Number (17) Number × Alphabetic (18) Ideographic × NumericPostfix (19) Numbers are of various forms that should not be broken across lines—for example, $(12.35), 12, (12)¢, 12.54¢, and so on. These are approximated with the following rule. (Some cases were already handled, such as ‘9,’ and ‘[9’.) NumericPrefix × ( Number | Open | HyphenMinus ) (20) ( HyphenMinus | Syntax | Number | NumericInfix ) × Number (21) Number × NumericPostfix (22) Close × NumericPostfix (23)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 131 5.15 Locating Text Element Boundaries Implementation Guidelines

Table 5-5. Line Boundaries (Continued)

Join alphabetic letters and break everywhere else. Alphabetic × Alphabetic (24) ÷ All (25) All ÷ (26)

Sentence Boundaries Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. Plain text provides inadequate information for determining good sentence boundaries. Periods, for example, can either signal the end of a sentence, indicate abbreviations, or be used for decimal points. Without analyzing the text semantically, it is impossible to be cer- tain which of these usages is intended (and sometimes ambiguities still remain). See Table 5-6. Table 5-6. Sentence Boundaries

Character Classes

CR Carriage Return LF Line Feed Sep LS, PS Sp Space separator Term !? Dot Period Cap Uppercase, titlecase, and noncased letters Lower Lowercase Open Open punctuation Close Close punctuation, period, comma, ...

Rules Always break before an LF, unless it is preceded by a CR. Always break before and after separating control or format characters. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷ ( CR | Sep ) (3) ( Sep | LF ) ÷ (4)

Break after sentence terminators, but include nonspacing marks, closing punctuation, trailing spaces, and (optionally) a paragraph separator. Term Close* Sp* {Sep} ÷ (5)

132 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.16 Identifiers

Table 5-6. Sentence Boundaries (Continued)

Handle a period specially, as it may be an abbreviation or numeric period— and not the end of a sentence. Don’t break if it is followed by a lowercase let- ter instead of an uppercase letter. Dot Close* Sp+ ÷ Open* ¬ Lower (6)

Random Access A further complication is introduced by random access (see Figure 5-14). When iterating through a string from beginning to end, the preceding approach works well. It guarantees a limited context, and it allows a fresh start at each boundary to find the next boundary. By constructing a state table for the reverse direction from the same specification of the rules, reverse searches are possible. However, suppose that the user wants to iterate starting at a random point in the text. If the starting point does not provide enough context to allow the correct set of rules to be applied, then one could fail to find a valid boundary point. For example, suppose a user clicked after the first space in “?_ _A”. On a forward iteration searching for a sentence boundary, one would fail to find the boundary before the “A”, because the “?” hadn’t been seen yet. A second set of rules to determine a “safe” starting point provides a solution. Iterate back- ward with this second set of rules until a safe starting point is located, then iterate forward from there. Iterate forward to find boundaries that were located between the starting point and the safe point; discard these. The desired boundary is the first one that is not less than the starting point. Figure 5-14. Random Access

? _ _ A ...

This process would represent a significant performance cost if it had to be performed on every search. However, this functionality could be wrapped up in an iterator object, which preserves the information regarding whether it currently is at a valid boundary point. Only if it is reset to an arbitrary location in the text is this extra backup processing performed.

5.16 Identifiers A common task facing an implementer of the Unicode Standard is the provision of a pars- ing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of guidelines is provided for the definition of identi- fier syntax. Note that the task of parsing for identifiers and the task of locating word boundaries are related, and it is a straightforward procedure to restate the sample syntax provided here in

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 133 5.16 Identifiers Implementation Guidelines the form just discussed for locating text element boundaries. In this section, a more tradi- tional BNF-style syntax is presented to facilitate into existing standards. The formal syntax provided here is intended to capture the general intent that an identifier consists of a string of characters that begins with a letter or an ideograph, and then includes any number of letters, ideographs, digits, or . Each programming language standard has its own identifier syntax; different programming languages have different conventions for the use of certain characters from the ASCII range ($, @, #, _) in identifiers. To extend such a syntax to cover the full behavior of a Unicode implementation, imple- menters need only combine these specific rules with the sample syntax provided here. These rules are no more complex than current rules in the common programming languages, except that they include more characters of different types. The innovations in the sample identifier syntax to cover the Unicode Standard correctly include the following: 1. Incorporation of proper handling of combining marks 2. Allowance for layout and format control characters, which should be ignored when parsing identifiers Combining Marks. Combining marks must be accounted for in identifier syntax. A com- posed character sequence consisting of a base character followed by any number of com- bining marks must be valid for an identifier. This requirement results from the conformance rules in Chapter 3, Conformance, regarding interpretation of canonical- equivalent character sequences. Enclosing combining marks (for example, U+20DD..U+20E0) are excluded from the syn- tactic definition of because the composite characters that result from their composition with letters (for example, U+24B6     ) are themselves not valid constituents of identifiers. Layout and Format Control Characters. The Unicode characters that are used to control joining behavior, bidirectional ordering control, and alternative formats for display are explicitly defined as not affecting breaking behavior. Unlike space characters or other delimiters, they do not serve to indicate word, line, or other unit boundaries. Accordingly, they are explicitly included for the purposes of identifier definition. Some implementations may choose to filter out these ignorable characters; this approach offers the advantage that two identifiers that appear to be identical will more likely be identical. Specific Character Additions. Specific identifier syntaxes can be treated as slight modifica- tions of the generic syntax based on character properties. Thus, for example, SQL identifi- ers allow an underscore as an identifier part (but not as an identifier start); C identifiers allow an underscore as either an identifier part or an identifier start. For the notation used in this section, see Section 0.2, Notational Conventions.

Syntactic Rule ::= ( | )* Identifiers are defined by a set of character categories from the Unicode Character Data- base. See Table 5-7. Implementations that require a stable definition of identifiers for conformance must refer- ence a specific version of the Unicode Standard, such as Version 3.0.0, or explicitly list the particular subset of Unicode characters that fall under their definition of syntactic class.

134 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching

Table 5-7. Syntactic Classes for Identifiers

Syntactic Class Equivalent Category Set Coverage Lu, Ll, Lt, Lm, Lo, Nl Uppercase letter, lowercase let- ter, titlecase letter, modifier let- ter, other letter, letter number Mn, Mc, Nd, Pc, Cf Nonspacing mark, spacing com- bining mark, decimal number, connector punctuation, for- matting code

5.17 Sorting and Searching Sorting and searching overlap in that both implement degrees of equivalence of terms to be compared. In the case of searching, equivalence defines when terms match (for example, it determines when case distinctions are meaningful). In the case of sorting, equivalence affects the proximity of terms in a sorted list. These determinations of equivalence always depend on the application and language, but for an implementation supporting the Uni- code Standard, sorting and searching must also take into account the Unicode character equivalence and canonical ordering defined in Chapter 3, Conformance. This section also discusses issues of adapting sublinear text searching algorithms, providing for fast text searching while still maintaining language-sensitivity, and using the same ordering algorithms that are used for collation. For more information, see Unicode Techni- cal Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site.

Culturally Expected Sorting Sort orders vary from culture to culture, and many specific applications require variations. Sort order can be by word or sentence, case sensitive or insensitive, ignoring accents or not; it can also be either phonetic or based on the appearance of the character, such as ordering by stroke and radical for East Asian ideographs. Phonetic sorting of Han characters requires use of either a lookup dictionary of words or special programs to maintain an associated phonetic spelling for the words in the text. Languages vary not only regarding which types of sorts to use (and in which order they are to be applied), but also in what constitutes a fundamental element for sorting. For exam- ple, Swedish treats U+00C4       as an individual let- ter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of ä following a. Spanish traditionally sorted the digraph ll as if it were a let- ter between l and m. Examples from other languages (and scripts) abound. As a result, it is neither possible to arrange characters in an encoding in an order so that simple binary string comparison produces the desired collation order, nor is it possible to provide single-level sort-weight tables. The latter implies that character encoding details have only an indirect influence on culturally expected sorting. To address the complexities of culturally expected sorting, a multilevel comparison algo- rithm is typically employed.1 Each character in string is given several categories of sort weights. Categories can include alphabetic, case, and diacritic weights, among others.

1. A good example can be found in Denis Garneau, Keys to Sort and Search for Culturally- Expected Results (IBM document number GG24-3516, June 1, 1990), which addresses the problem for Western European languages, Arabic, and Hebrew.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 135 5.17 Sorting and Searching Implementation Guidelines

In a first pass, these weights are accumulated into a sort key for the string. At the end of the first pass, the sort key contains a string of alphabetic weights, followed by a string of case weights, and so on. In a second pass, these substrings are compared by order of importance so that case and accent differences can either be fully ignored or applied only where needed to differentiate otherwise identical sort strings. The first pass looks very similar to the decomposition of Unicode characters into base char- acter and accent. The fact that the Unicode Standard permits multiple spellings (composed and composite) of the same accented letter turns out not to matter at all. If anything, a completely decomposed text stream can simplify the first implementation of sorting. To provide a powerful, table-based approach to natural-language collation using Unicode characters, implementers need to consider providing full functionality for these features of language-sensitive algorithmic sorting: • Four collation levels • French or normal orientation • Contracting or expanding characters • Ordering of unmapped characters • More than one level of ignorable characters

Unicode Character Equivalence Section 3.6, Decomposition, and Section 3.10, Canonical Ordering Behavior, define equiva- lent sequences and provide an exact algorithm for determining when two sequences are equivalent. Equivalent sequences of Unicode characters should be collated in exactly the same way, no matter what the underlying storage is. Figure 5-15 gives two examples of character equivalence. Figure 5-15. Character Equivalence ä = a @¨ a @¨ @. = a @. @¨ a @¨ @` ≠ a @` @¨

Compatibility characters—especially where they have the same appearance—should also be collated in exactly the same way (for example, U+00C5 Å        and U+212B Å  ).

Similar Characters Languages differ in what they consider similar characters, but users generally want charac- ters that are similar (such as upper- and lowercase) to sort close to each other but not to be collated exactly the same way. If upper- and lowercase collated identically, words differing

136 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching only in case would appear in random order (see Figure 5-16). The same is true for accented characters. Typically, there is an ordering among these similar characters. In English dictionaries, for example, lowercase precedes uppercase (the reverse of what happens with a naïve ASCII comparison). However, this ordering applies only when the strings are the same in all other respects. Otherwise, Aachen would sort after azure. Characters with accents often sort close to the base character, but different accents on the same base character always sort in a given order. Figure 5-16. Naïve Comparison

pat pat É pet Pat É PAT pit É É pet É pot pit É put É É pot Pat É É put PAT É Naïve Comparison Priority Level Comparison

Levels of Comparison The way to handle these problems is to use multiple levels of comparison and attach only a secondary or tertiary difference to the letters based on their case or accents (see Figure 5-17). Thus we get the following rules: R1 Count secondary differences—only if there are no primary differences. R2 Count tertiary differences—only if there are no primary or secondary differences. Figure 5-17. Levels of Comparison resume resume résumé Resume resumes résumé résumés Résumé Secondary Difference Tertiary Difference

In English and similar languages, accents make only a secondary difference, and case differ- ences make only a tertiary difference. Ignorable characters are counted as secondary or ter- tiary differences. In other languages and scripts, other features map to secondary or tertiary differences.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 137 5.17 Sorting and Searching Implementation Guidelines

French presents an interesting special case. In French sorting, the differences in accents later in the string are more important than those earlier in the string. In Figure 5-18, the two pairs of words are identical in the first five base characters. In English, the first accents are the most significant ones; in French, the first accents from the end are as the boxes show. Figure 5-18. Orientation

Primary Sequence Primary Sequence Secondary Sequence Sequence Secondary péché pêche pêche péché pécher pécher pêcher pêcher Forward Sequence Backward Sequence (French)

Ignorable Characters Another class of interesting cases involves ignorable characters (see Figure 5-19), such as spaces, , and some other punctuation. In this case, the character itself is ignored unless there are no stronger differences in the string. Figure 5-19. Ignorable Characters blackbird role black-bird roˆ@le blackbirds roles black-birds roˆ@les Ignorable punctuation Ignorable accents

The general rule is as follows: R3 Treat ignorable characters as having no primary difference.

Multiple Mappings With many language collations such as traditional Spanish or German, one character may compare as if it were two, or two characters may compare as if they were one character (see Figure 5-20). In traditional Spanish orthography, for example, “Ch” sorts as a single let- ter—that is, after “Cz”; otherwise, it would come before “Ci”. In traditional German, “ö” sorts as if it were “oe”, putting it after “od” and before “of”.

138 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching

Figure 5-20. Multiple Mappings Cza r Tod QS Chico Tšn e QU Hermos a Tofu Za pa t a Ton RS Grouped (n:1) Split (1:n) RT General (n:m)

In Japanese, a á length mark lengthens the vowel of the preceding character; depending on the vowel, the result will sort in a different order. For example, after the character  “ka”, the á length mark indicates a long “a” and comes after ‡ “a”; after the character ’ “”, the á length mark indicates a long “i” and comes after ‰ “i”. Look at the second character in each word in the third column of Figure 5-20. There are three different characters in the second position: ‡ “a”, á length mark, and ‰ “i”. The gen- eral rule is as follows: R4 N characters may compare as if they were M.

Collating Out-of-Scope Characters Collation implements an ordering that matches the expectations of the user, based on rules of the user’s language. Lists of terms encoded using the Unicode Standard may easily come from many different languages. These terms are all sorted according to the custom of the user’s language. For scripts and characters outside the use of a particular language, explicit rules may not exist. For example, Swedish and French have clear and different rules on sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines a particular sorting order for the Han ideographs. Implementations supporting the Uni- code Standard therefore typically provide a default ordering (like the culturally neutral ordering for ideographs used in this standard). Sorting for a Japanese user would still sort upper- and lowercase Latin letters in proximity. The relative ordering of scripts is typically configurable. R5 Default to a common or culturally neutral ordering for out-of-scope characters.

Unmapped Characters Another option is to treat out-of-scope characters as irrelevant. Such characters can include box forms, dingbats, and perhaps also alphabets that are not of concern for the user base of an implementation. Characters irrelevant to a collation sequence are usually not assigned weights; this choice saves space in the collation sequence. However, to provide a definitive sorting order, a position needs to be specified in the collation sequence for any unassigned character. For efficiency, collate any unassigned characters in Unicode bit order. R6 Collate irrelevant characters in Unicode bit order, in a specified position.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 139 5.17 Sorting and Searching Implementation Guidelines

Parameterization For effective use of collation, programmers need to have a certain level of control and need to be able to get back sufficient information. One output needs to be the place in the string where a difference occurs, so that common initial substrings can be found. Notice that this position may be different in the two different strings. Because of multiple mappings, the first three characters in one string might be equivalent to the first four in the other. Input • Specify language rule • Relative order of scripts • Allows specification of precision Output • First point in each string where different • Direction and precision of difference Processing • Can preprocess each string for fast comparison • Process just as much as needed for a stand-alone comparison

Optimizations Multiple-level comparison requires a bit more work than binary comparison. Although real-life studies put the overhead at around 50 percent, it often pays to first transform terms to be sorted into equivalent sort keys, which result in the same sorted list when subjected to a simple and fast binary comparison. (In the standard C library, the function wcsxfrm provides such a transformation.) These sort keys might consist of a string of base weights followed by strings for weights used for secondary and tertiary differences, as just dis- cussed. Unlike Unicode code values, sort keys don’t need to be 16-bit-based. Thus highly optimized functions, such as the strcmp function from the standard C library, can be used. Sort keys can also be stored, obviating recomputation when a list needs to be -sorted. Another straightforward optimization is to compare as you go. For each string, sort weights are assembled into sort keys only until a difference is located. This approach reduces the computation necessary when a difference is found early in the string.

Searching Searching is subject to many of the same issues as comparison, such as a choice of a weak, strong, or exact match. Other features are often added, such as only matching words (that is, where a word boundary appears on each side of the match). One technique is to code a fast search for a weak match. When a candidate is found, additional tests can be made for other criteria (such as matching diacriticals, word match, case match, and so on). When searching strings, it is necessary to check for trailing nonspacing marks in the target string that may affect the interpretation of the last matching character. That is, a search for “ Jose” may find a match in the string “Visiting San José, Costa Rica is a...”. If an exact (diacritic) match is desired, then this match should be rejected. If a weak match is sought, then the match should be accepted, but any trailing nonspacing marks should be included when returning the location and length of the target substring. The mechanisms discussed in Section 5.15, Locating Text Element Boundaries, can be used for this purpose.

140 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.18 Case Mappings

One important application of weak equivalence is case-insensitive searching. Many tradi- tional implementations map both the search string and the target text to uppercase. How- ever, case mappings are language-dependent and not unambiguous (see Section 4.1, Case— Normative, and Section 7.1, Latin). The preferred method of implementing case insensitiv- ity uses the same mechanisms and tables described in the sorting discussions in the begin- ning of this section. In particular, it is advisable for many applications (for example, file systems) to treat a particular set of characters (i, I, –dotless-i,•capital i with dot) as a single equivalency class to guarantee reasonable results for Turkish. A related issue can arise because of inaccurate mappings from external character sets. To deal with this problem, characters that are easily confused by users can be kept in a weak equivalency class (ñ d-bar, Õ eth, ƒ capital d-bar, µ capital eth). This approach tends to do a better job of meeting users’ expectations when searching for named files or other objects.

Sublinear Searching International searching is clearly possible using the information in the collation, just by using brute force. However, this tactic requires an O(m*n) algorithm in the worst case and O(m) in common cases, where n is the number of characters in the pattern that is being searched for and m is the number of characters in the target to be searched. A number of algorithms allow for fast searching of simple text, using sublinear algorithms. These algorithms use only O(m/n) in common cases, by skipping over characters in the tar- get. Several implementers have adapted one of these algorithms to search text pre- transformed according to a collation algorithm, which allows for fast searching with native-language matching (see Figure 5-21). Figure 5-21. Sublinear Searching The_quick_brown… quick qui ck qu i ck quick quick

The main problems with adapting a language-aware collation algorithm for sublinear searching relate to multiple mappings and ignorables. Additionally, sublinear algorithms precompute tables of information. Mechanisms like the two-stage tables introduced in Figure 5-1 are efficient tools in reducing memory requirements.

5.18 Case Mappings The vast majority of case mappings are uniform across languages. In a few instances, upper- and lowercase mappings may differ from language to language between writing sys- tems that employ the same letters. The principal example is Turkish, where U+0131 “–”      maps to U+0049 “I”     and U+0069 “i”     maps to U+0130 “•”       , as shown in Figure 5-22.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 141 5.18 Case Mappings Implementation Guidelines

Figure 5-22. Case Mapping for Turkish I

– õ I i õ •

The process of case mapping has important exceptions. See the file SpecialCasing.html on the CD-ROM for more about these exceptions. For more information on case mapping data, see Section 4.1, Case—Normative, and Unicode Technical Report #21, “Case Map- pings,” on the CD-ROM or the up-to-date version on the Unicode Web site. Case Mappings Not Reversible. It is important to note that casing operations do not always provide a round-trip mapping. Also, because many characters are really caseless (most of the IPA block, for example), uppercasing a string does not mean that it will no longer contain any lowercase letters. Case correspondences are not always one-to-one: the result of case folding may be a differ- ent character length than in the source string. For example, U+00DF ß   -    becomes “SS” in uppercase. As discussed in Section 7.2, Greek, the iota-subscript characters used to represent ancient text can be viewed as having special case mappings. Normally, the uppercase and lowercase forms of alpha-iota-subscript will map back and forth. In some instances, where uppercase words should be transformed into their older spellings by removing accents and changing the iota-subscript into a capital iota (and perhaps even removing spaces). Note that case transformations are not reversible. For example, upper(lower(“John Brown”)) Å “JOHN BROWN” lower(upper(“John Brown”)) Å “john brown” There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper-, lower-, nor titlecase. This format is sometimes called inner-caps. Also, some single characters do not have reversible mappings. For example, U+03C2 §      uppercases to U+03A3 ˆ    , but the capital lowercases to (nonfinal) U+03C3 ¨    . For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and then return to it in the sequence of keys. The user interface would produce the following results in response to a series of command keys. Notice that the original string is restored every fourth time. 1. The quick brown 2. THE QUICK BROWN 3. the quick brown 4. The Quick Brown 5. The quick brown Uppercase, titlecase, and lowercase can be represented in a word processor by using a char- acter style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.

142 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.18 Case Mappings

For information on case conversion, detecting when strings are of a given case, and per- forming caseless matching of strings, see Unicode Technical Report #21, “Case Mappings,” on the CD-ROM or the up-to-date version on the Unicode Web site.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 143 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 6 Punctuation 6

The appearance and usage of punctuation marks varies between languages and scripts, but punctuation marks have a common function. They separate units of text, such as sentences and phrases, thus clarifying the meaning of the text. The use of punctuation marks is not limited to prose; they are also used in mathematical and scientific formulae, for example. In the Unicode Standard, punctuation marks are called punctuation characters. In the standard, characters are organized into related groups called blocks. Character blocks generally contain characters from a single script. In many cases, a script is fully rep- resented in its character block. There are, however, important exceptions, most notably in the area of punctuation characters. Punctuation characters occur in several widely sepa- rated places in the character blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, and CJK Symbols and Punctuation, as well as occasional characters in charac- ter blocks for specific scripts. Punctuation characters—for example, U+002C  or U+2022 —are encoded only once, rather than being encoded again and again for particular scripts; such general- purpose punctuation may be used for any script or mixture of scripts. In contrast, punctu- ation characters that are encoded in a given script block—for example, U+058A   or U+060C  —are intended primarily for use in the context of that script. They are unique in function, have different directionality, or are distinct in appear- ance or usage from their generic counterparts. The use and interpretation of punctuation characters can be heavily context-dependent. For example, U+002E   can be used as sentence-ending punctuation, an abbrevi- ation indicator, a decimal point, and so on. Punctuation characters vary in appearance with the font style, just like the surrounding text characters. In some cases, where used in the context of a particular script, a specific glyph style is preferred. For example, U+002E   should appear square when used with Armenian, but is typically circular when used with Latin. For mixed Latin/Armenian text, two fonts (or one font allowing for context-dependent glyph variation) may need to be used to faithfully render the character. In a bidirectional context (see Section 3.12, Bidirectional Behavior), shared punctuation characters have no inherent directionality, but resolve according to the Unicode bidirec- tional algorithm. Where the image of a punctuation character is not bilaterally symmetric, the mirror image is used when the character is part of the right-to-left text stream (see Section 4.7, Mirrored—Normative). In vertical writing, many punctuation characters have special vertical glyphs. A number of characters in the blocks described in this chapter are not graphic punctuation characters, but nevertheless affect the operation of layout algorithms. For a description of these characters, see Section 13.2, Layout Controls.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 147 6.1 General Punctuation Punctuation

6.1 General Punctuation

Punctuation: U+0020–U+00BF Standards. The Unicode Standard adapts the ASCII (ISO 646) 7-bit standard by retaining the semantics and numeric code values, merely supplying enough leading zeros to convert them into 16-bit values. The content and arrangement of the ASCII standard are far from optimal in the context of a 16-bit space, but the Unicode Standard retains it without change because of its prevalence in existing usage. The ASCII (ANSI X3.4) standard is identical to ISO/IEC 646:1991-IRV. ASCII Graphic Characters. Some of the nonletter characters in this range suffer from overburdened usage as a result of the limited number of codes in a 7-bit space. Some cod- ing consequences of this problem are discussed in this section under “Encoding Characters with Multiple Semantic Values” and “General Punctuation,” but also see “Language-Based Use of Quotation Marks.” The rather haphazard ASCII collection of punctuation and mathematical signs are isolated from the larger body of Unicode punctuation, signs, and symbols (which are encoded in ranges starting at U+2000) only because the relative loca- tions within ASCII are so widely used in standards and software. Typographic Variations. Code values in the ASCII range are well established and used in widely varying implementations. The Unicode Standard therefore provides only minimal specifications on the typographic appearance of corresponding glyphs. For example, the value U+0024 ($) (derived from ASCII 24) has the semantic , leaving open the question of whether the dollar sign is to be rendered with one vertical stroke or two. The Unicode value U+0024 refers to the dollar sign semantic, not to its precise appearance. Likewise, in old-style numerals, where numbers vary in placement above and below the baseline, a decimal or thousands separator may be displayed with a dot that is raised above the baseline. Because it would be inadvisable to have a stylistic variation between old-style and new-style numerals that actually changes the underlying representation of text, the Unicode Standard considers this raised dot to be merely a glyphic variant of U+002E “.”  . For other characters in this range that have alternative glyphs, the Unicode character is displayed with the basic or most common glyph; rendering software may present any other graphical form of that character. Encoding Characters with Multiple Semantic Values. Some ASCII characters have multi- ple uses, either through ambiguity in the original standards or through accumulated rein- terpretations of a limited code set. For example, 27 hex is defined in ANSI X3.4 as (closing single ; acute accent), and 2D hex is defined as hyphen- minus. In general, the Unicode Standard provides the same interpretation for the equiva- lent code values, without adding to or subtracting from their semantics. The Unicode Stan- dard supplies unambiguous codes elsewhere for the most useful particular interpretations of these ASCII values; the corresponding unambiguous characters are cross-referenced in the character names list for this block. For a complete list of space characters and dash characters in the Unicode Standard, see “General Punctuation” later in this section. For historical reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apos- trophe modifier or acute accent). Punctuation marks generally break words; modifier let- ters generally are considered part of a word.

148 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation

The preferred character for apostrophe is U+2019, but U+0027 is commonly present on keyboards. In modern software, it is therefore common to substitute U+0027 by the appro- priate character in input. In these systems, a U+0027 in the data stream is always repre- sented as a straight vertical line and can never represent a curly apostrophe or a right quotation mark. For more information, see “” later in this section. Semantics of Paired Punctuation. Paired punctuation marks such as parentheses (U+0028, U+0029), square brackets (U+005B, U+005D), and braces (U+007B, U+007D) are interpreted semantically rather than graphically in the context of bidirectional or verti- cal texts; that is, these characters have consistent semantics but alternative glyphs depending upon the directional flow rendered by a given software program. The software must ensure that the rendered glyph is the correct one. When interpreted semantically rather than graphically, characters containing the qualifier “” are taken to denote opening; charac- ters containing the qualifier “” are taken to denote closing. For example, U+0028   and U+0029   are interpreted as opening and closing parentheses, respectively, in the context of bidirectional or vertical texts. In a right-to-left directional flow, U+0028 is rendered as “)”. In a left-to-right flow, the same character is rendered as “(”. See also “Language-Based Use of Quotation Marks” later in this section. Tilde. U+007E  can be used either as a Spacing Clone of Combining Tilde (see “Spacing Clones of Diacritics” in Section 7.1, Latin) or more often as a center line tilde, similar in appearance to U+223C “~”  . Two common uses are to indicate an approximate value or in dictionaries to repeat the defined term in the definition of the ~. Although U+007E “~”  is ambiguous in its rendering, modern fonts generally render it with a center line glyph, as shown in the code charts.

General Punctuation: U+2000–U+206F General punctuation combines punctuation characters and characterlike elements used to achieve certain text layout effects. Some punctuation characters can be used with many dif- ferent scripts. Many general punctuation characters can also be found in the Basic Latin (ASCII) and Latin-1 Supplement blocks. In many cases, current standards include generic characters for punctuation instead of the more precisely specified characters used in printing. Examples include the single and dou- ble quotes, period, dash, and space. The Unicode Standard includes these generic charac- ters, but also encodes the unambiguous characters independently: various forms of quotation mark, decimal period, em dash, en dash, minus, hyphen, em space, en space, hair space, zero-width space, and so on. Punctuation principally used with a specific script is found in the block corresponding to that script, such as U+061B “ý”   or the punctuation used with ideo- graphs in the CJK Symbols and Punctuation Block.

Space Characters The most commonly used space character is U+0020 . Also often used is its non- breaking counterpart, U+00A0 - . These two characters have the same width, but behave differently for line breaking. U+00A0 -  behaves like a numeric separator for the purposes of bidirectional layout. (See Section 3.12, Bidirectional Behavior, for a detailed discussion of the Unicode bidirectional algorithm.) In ideographic text, U+3000   is commonly used because its width matches that of the ideographs.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 149 6.1 General Punctuation Punctuation

The main difference among other space characters is their width. U+2000..U+2006 are standard quad widths used in typography. U+2007   has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008   is a space defined to be the same width as a period. U+2009   and U+200A   are successively smaller-width spaces used for narrow word gaps and for justi- fication of type. The fixed-width space characters (U+2000..U+200A) are derived from conventional (hot lead) typography. Algorithmic kerning and justification in computerized typography do not use these characters. However, where they are used, as, for example, in typesetting mathematical formulae, their width is generally font-specified, and they typi- cally do not expand during justification. The exception is U+2009  , which sometimes gets adjusted. Space characters with special behavior in word or line breaking are described in “Line and Word Breaking” in Section 13.2, Layout Controls. Space characters may also be found in other character blocks in the Unicode Standard. The list of space characters appears in Table 6-1. Table 6-1. Unicode Space Characters Code Name U+0020  U+00A0 -  U+2000   U+2001   U+2002   U+2003   U+2004 --  U+2005 --  U+2006 --  U+2007   U+2008   U+2009   U+200A   U+200B    U+202F  -  U+3000  

U+200B   , as well as several spacelike, zero-width characters with special properties, are described in Section 13.2, Layout Controls.

Dashes and Hyphens Because of its prevalence in legacy encodings, U+002D - is the most com- mon of the dash characters used to represent a hyphen. It has ambiguous semantic value and is rendered with an average width. U+2010  represents the hyphen as found in words such as “left-to-right.” It is rendered with a narrow width. When typesetting text, U+2010  is preferred over U+002D -. U+2011 -  is present for compatibility with existing standards. It has the same semantic value as U+2010 , but should not be broken across lines. U+2012   is present for compatibility with existing standards; it has the same (ambiguous) semantic as the U+002D -, but has the same width as digits (if they are monospaced). U+2013   is used to indicate a range of values, such as 1973– 1984. It should be distinguished from the U+2122 , which is an arithmetic operator; however, typographers have typically used U+2013   in typesetting to represent the minus sign. For general compatibility in interpreting formulas, U+002D -,

150 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation

U+2012  , and U+2212   should each be taken as indicating a minus sign, as in “x = a - b.” U+2014   is used to make a break—like this—in the flow of a sentence. It is com- monly represented with a as a double-hyphen. In older mathematical typogra- phy, U+2014   is also used to indicate a binary minus sign. U+2015   is used to introduce quoted text in some typographic styles. Dashes and hyphen characters may also be found in other character blocks in the Unicode Standard. A list of dash and hyphen characters appears in Ta ble 6-2. For a description of the line-breaking behavior of dashes and hyphens, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Unicode Web site. Table 6-2. Unicode Dash Characters Code Name U+002D - U+007E  (= swung dash) U+00AD   U+058A   U+1806     U+2010  U+2011 -  U+2012   U+2013   U+2014   U+2015   (= quotation dash) U+207B   U+208B   U+2212   U+301C   U+3030  

Language-Based Usage of Quotation Marks U+0022   is the most commonly used character for quotation mark. How- ever, it has ambiguous semantics and direction. Word processors commonly offer a facility for automatically converting the U+0022   to a contextually-selected curly quote glyph. Low Quotation Marks. U+201A  -   and U+201E  -   are unambiguously opening quotation marks. All other quotation marks have heterogeneous semantics. They may represent opening or closing quotation marks depending on the usage. European Usage. The use of quotation marks differs systematically by language and by medium. In European typography, it is common to use (single or double angle quotation marks) for books and, except for some languages, curly quotation marks in office automation. Single guillemets can be found for quotes inside quotes. The following description does not attempt to be complete, but intends to document a range of known usages of quotation mark characters. Some of these usages are also illustrated in Figure 6-1. In this section, the words single and double are omitted from character names where there is no conflict or both are meant. Dutch, English, Italian, Portugese, Spanish, and Turkish use a left quotation mark and a right quotation mark for opening and closing quotations, respectively. It is typical to alter- nate single and double quotes for quotes within quotes. Whether single or double quotes are used for the outer quotes depends on local and stylistic conventions.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 151 6.1 General Punctuation Punctuation

Czech, German, and Slovak use the low-9 style of quotation mark for opening instead of the standard open quotes. They employ the left quotation mark style of quotation mark for closing instead of the more common right quotation mark forms. When guillemets are used in German books, they point to the quoted text. This style is the inverse of French usage. Danish, Finnish, Norwegian, and Swedish use the same right quotation mark character for both the opening and closing quotation character. This usage is employed both for office automation purposes and for books. Books sometimes use the , U+00BB -     , for both opening and closing. Hungarian and Polish usage of quotation marks is similar to the Scandinavian usage, except that they use low double quotes for opening quotations. Presumably, these lan- guages avoid the low single quote so as to prevent confusion with the comma. French, Greek, Russian and Slovenian, among others, use the guillemets, but Slovenian usage is the same as German usage in their direction. Of these languages, at least French inserts space between text and quotation marks. In the French case, U+00A0 -  can be used for the space that is enclosed between quotation mark and text; this choice helps line-breaking algorithms. Figure 6-1. European Quotation Marks

Single right quote = apostrophe ‘quote’ don’t

Usage depends on language “English”«French» „German“»Slovenian« ”Swedish”»Swedish books»

East Asian Usage. The glyph for each quotation mark character for an Asian character set occupies predominantly a single quadrant of the character cell. The quadrant used depends on whether the character is opening or closing and whether the glyph is for use with hori- zontal or vertical text. The pairs of quotation characters are listed in Table 6-3. Table 6-3. East Asian Quotation Marks

Style Opening Closing Corner bracket 300C 300D White corner bracket 300E 300F Double prime 301D 301F

Glyph Variation. The glyphs for “double-prime” quotation marks consist of a pair of wedges, slanted either forward or backward, with the tips of the wedges pointing either up or down. In a pair of double-prime quotes, the closing and the opening character of the pair slant in opposite directions. Two common variations exist as shown in Figure 6-2. To

152 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation confuse matters more, another form of double-prime quotation marks is used with West- ern style, horizontal text, in addition to the curly single or double quotes. Figure 6-2. Asian Quotation Marks

Horizontal and vertical glyphs

Glyphs for overloaded character codes “Text” Font style based glyph alternates

Three pairs of quotation marks are used with Western-style, horizontal text, as shown in Table 6-4. Table 6-4. Opening and Closing Forms

Style Opening Closing Comment Single 2018 2019 Rendered as “wide” character Double 201C 201D Rendered as “wide” character Double prime 301D 301E

Overloaded Character Codes. The character codes for standard quotes can refer to regular narrow quotes from a Latin font used with Latin text, as well as wide quotes from an Asian font used with other wide characters. This situation can be handled with some success where the text is marked up with language tags. Consequences for Semantics. The semantics of U+00AB, U+00BB (double guillemets), and U+201D (right double quotation) are context-dependent. The semantics of U+201A and U+201B (low-9 quotation marks) are always opening; this usage is distinct from the usage of U+301E     , which is unambiguously closing.

Apostrophes U+0027  is the most commonly used character for apostrophe. However, it has ambiguous semantics and direction. When text is set, U+2019   -   is preferred as apostrophe. Word processors commonly offer a facility for auto- matically converting the U+0027  to a contextually-selected curly quotation glyph. Letter Apostrophe. U+02BC    is preferred where the apos- trophe is to represent a modifier letter (for example, in to indicate a ). In the latter case, it is also referred to as a letter apostrophe.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 153 6.1 General Punctuation Punctuation

Punctuation Apostrophe. U+2019     is preferred where the character is to represent a punctuation mark, as for contractions: “We’ been here before.” In this latter case, U+2019 is also referred to as a punctuation apostrophe. An implementation cannot assume that users’ text always adheres to the distinction between these characters. The text may come from different sources, including mapping from other character sets that do not make this distinction between the letter apostrophe and the punctuation apostrophe/right single quotation mark. In that case, all of them will generally be represented by U+2019. The semantics of U+2019 are therefore context-dependent. For example, if surrounded by letters or digits on both sides, it behaves as an in-text punctuation character and does not separate words or lines.

Other Punctuation Hyphenation Point. U+2027   is a raised dot used to indicate correct word breaking, as in dic·tion·ar·ies. It is a punctuation mark, to be distinguished from U+00B7  , which has multiple semantics. Fraction Slash. U+2044   is used between digits to form numeric fractions, such as 2/3, 3/9, and so on. The standard form of a fraction built using the fraction slash is defined as follows: Any sequence of one or more decimal digits, followed by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction should be 3 displayed as a unit, such as ¾ or as --- . The precise choice of display can depend upon addi- 4 tional formatting information. If the displaying software is incapable of mapping the fraction to a unit, then it can also be displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 +    + 3 +   + 4 is displayed as 1¾. Spacing Overscore. U+203E  is the above-the-line counterpart to U+005F  . It is a spacing character, not to be confused with U+0305  . As with all over- or underscores, a sequence of these characters should connect in an unbroken line. The overscoring characters also must be distinguished from U+0304  , which does not connect horizontally in this way. Doubled Punctuation. Several doubled punctuation characters that have compatibility decompositions into a sequence of two punctuation marks are also encoded as single char- acters: U+203C   , U+2048   , and U+2049   . These doubled punctuation marks are included as an implementation convenience for East Asian and Mongolian text, which is rendered ver- tically. Bullets. U+2022  is the typical character for a . Within the general punctua- tion, several alternative forms for bullets are separately encoded: U+2023  , U+204C   , and so on. U+00B7   also often functions as a small bullet. Bullets mark the head of specially formatted paragraphs, often occurring in lists, and may in fact use arbitrary graphics or dingbat forms as well as more conventional bullet forms. U+261E    , for example, is often used to highlight a note in text, as a kind of gaudy bullet. Paragraph Marks. U+00A7   and U+00B6   are often used as visible indications of sections or paragraphs of text, in editorial markup, to show format modes, and so on. Which character indicates sections and which character indicates

154 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation paragraphs may vary by convention. U+204B    is a fairly common alternate representation of the paragraph mark.

CJK Symbols and Punctuation: U+3000–U+303F This block encodes punctuation marks and symbols used primarily by writing systems that employ Han ideographs. Most of these characters are found in East Asian standards. U+3000   is provided for compatibility with legacy character sets. It is a fixed-width space appropriate for use with an ideographic font. U+301C   and U+3030   are special forms of dashes found in East Asian character standards. (For a list of other space and dash characters in the Unicode Standard, see Table 6-1 and Table 6-2.) U+3037       is a visible indicator of the line feed separator symbol used in the Chinese telegraphic code; it is comparable to the pictures of control codes found in the Control Pictures block. U+3005    is used to stand for the second of a pair of identical ideographs occurring in adjacent positions within a document. U+3006    is used frequently on signs to indicate that a store or booth is closed for business. The Japanese pronunciation is shime, most often encountered in the compound shime-kiri. U+3012   is used in Japanese addresses immediately preceding the numerical postal code. It is also used on forms and applications to indicate the blank space in which a postal code is to be entered. U+3020    and U+3016    are properly glyphic variants of U+3012, and are included for compatibility. U+3031     and U+3032         are used only in vertically written Japanese to repeat pairs of kana characters occurring immediately prior in a document. The voiced variety U+3032 is used in cases where the repeated kana are to be voiced. For instance, a repetitive phrase like toki- doki could be expressed in vertical writing. Both of these characters are intended to be represented by “double height” glyphs requiring two ideo- graphic “cells” to print; this intention also explains the existence in source standards of the characters representing the top and bottom halves of these (that is, the characters U+3033, U+3034, and U+3035). In horizontal writing, similar characters are used, and they are sep- arately encoded. In Hiragana, the equivalent repeat marks are encoded at U+309D and U+309E; in Katakana, they are U+30FD and U+30FE.

Unknown or Unavailable Ideographs U+3013   is used to indicate the presence of, or to hold a place for, an ideograph that is not available when a document is printed. It has no other use. Its name comes from its resemblance to the mark left by traditional Japanese sandals (geta); a variety of light and heavy glyphic variants occur. U+303E    is a graphic character that is to be rendered visibly. It alerts the user that the intended character is similar to, but not equal to, the char- acter that follows. Its use is similar to the existing character U+3013  . A   substitutes for the unknown or unavailable character, but does not identify it. The    is the head of a two-character sequence that gives some indication about the intended glyph or intended character. Ultimately, the -    and the character following it are intended to be replaced

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 155 6.1 General Punctuation Punctuation by the correct character, once it has been identified or a font resource or input resource has been provided for it. U+303F  -  is a visible indicator of a display cell filler used when ideographic characters have been split during display on systems using a double-byte character encoding. It is included in the Unicode Standard for compatibility. See also “Ideographic Description Sequences” in Section 10.1, Han.

CJK Compatibility Forms: U+FE30–U+FE4F A number of presentation forms encoded in this block are found in the Republic of China (Taiwan) national standard CNS 11643. These vertical forms of punctuation characters are provided for compatibility with those legacy implementations that encode these characters explicitly when Chinese text is being set in vertical rather than horizontal lines. The pre- ferred Unicode encoding is to encode the nominal characters that correspond to these ver- tical variants. Then, at display time, the appropriate glyph is selected according to the line orientation.

Small Form Variants: U+FE50–U+FE6F The Republic of China (Taiwan) national standard CNS 11643 also encodes a number of small variants of ASCII punctuation. The characters of this block, while construed as fullwidth characters, are nevertheless depicted using small forms that are set in a fullwidth display cell. (See the discussion in Section 10.3, Katakana.) These characters are provided for compatibility with legacy imple- mentations. Unifications. Two small form variants from CNS 11643/plane 1 were unified with other   characters outside the ASCII block: 213116 was unified with U+00B7 , and   226116 was unified with U+2215 .

156 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 7 European Alphabetic Scripts 7

Modern European alphabetic scripts are derived from or influenced by the Greek script. The word alphabet is derived from the Greek word alphabetos, itself derived from the names of the first two Greek letters, alpha and . The Greek script itself is an adaptation of the . A Greek innovation was writing the letters from left to right, which is the writing direction for all the scripts derived from or inspired by Greek. The European alphabetic scripts in the Unicode Standard, •Latin • Greek • Cyrillic •Armenian •Georgian •Ogham •Runic are written from left to right. Many have separate lowercase and uppercase forms of the alphabet. Spaces are used to separate words. Accents and diacritical marks are used to indi- cate phonetic features and to extend the use of base scripts to additional languages. The process of applying diacritical marks is potentially open-ended—one of the reasons com- bining marks are encoded in the Unicode Standard. Latin and Cyrillic are used to write or transliterate texts in a wide variety of languages. The is derived from the alphabet used by the Etruscans, who had adopted a west- ern variant of the classical . Originally it contained only 24 capital letters. The modern Latin alphabet as it is found in the Basic Latin block owes its appearance to innovations of scribes during the and practices of the early print- ers. The was developed in the ninth century and is also based on Greek. The Georgian and Armenian scripts were invented in the fifth century and are influenced by Greek. Modern Georgian does not have separate upper- and lowercase forms. The International Phonetic Alphabet is an extension of the Latin alphabet, enabling it to represent the phonetics of all languages. The two historic scripts of northwestern Europe, Runic and Ogham, have a distinct appear- ance owing to their primary use in carving inscriptions in stone and wood. They are con- ventionally rendered left to right in scholarly literature, but on the original stone carvings often proceeded in an arch tracing the outline of the stone.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 159 7.1 Latin European Alphabetic Scripts

7.1 Latin The Latin script was derived from the Greek script. Today it is used to write a wide variety of languages all over the world. In the process of adapting it to other languages, numerous extensions have been devised. The most common is the addition of diacritical marks. Fur- thermore, the creation of digraphs, inverse or reverse forms, and outright new characters have all been used to extend the Latin script. The Latin script is written in linear sequence from left to right. Spaces are used to separate words and provide the primary line-breaking opportunities. Hyphens are used where lines are broken in the middle of a word. (For more information, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Uni- code Web site.) Latin letters come in upper- and lowercase pairs. Diacritical Marks. Speakers of different languages treat the addition of a diacritical mark to a base letter differently. In some languages, the combination is treated as a letter in the alphabet for the language. In others, such as English, the same words can often be spelled with and without the diacritical mark without implying any difference. Most languages that use the Latin script treat letters with diacritical marks as variations of the base letter, but do not accord the combination the full status of an independent letter in the alphabet. The encoding for the Latin script in the Unicode standard is sufficiently flexible to allow implementations to support these letters according to the users’ expectation, as long as the language is known. Widely used accented character combinations are provided as single characters to accommodate interoperation with pervasive practice in legacy encodings. Combining diacritical marks can express these and all other accented letters as combining character sequences. In the Unicode Standard, all diacritical marks are encoded in sequence after the base char- acters to which they apply. For more details, see subsection on “Combining Diacritical Marks” in Section 7.9, Combining Marks, and also Section 2.6, Combining Characters. Standards. Unicode follows ISO 8859-1 in the layout of Latin letters up to U+00FF. ISO 8859-1, in turn, is based on older standards, among others ASCII (ANSI X3.4), which is identical to ISO/IEC 646:1991-IRV. Like ASCII, ISO 8859-1 contains Latin letters, punctu- ation signs, and mathematical symbols. The use of the additional characters is not restricted to the context of Latin script usage. The description of these characters is found in Chapter 6, Punctuation. Related Characters. For other Latin or Latin-derived characters, see letterlike symbols (U+2100..U+214F), currency symbols (U+20A0..U+20CF), (U+2600..U+26FF), enclosed alphanumberics (U+2460..U+24FF), and fullwidth forms (U+FF21..U+FF5A).

Letters of Basic Latin: U+0041–U+007A Only a small fraction of the languages written with the Latin script can be written entirely with the basic set of 26 uppercase and 26 lowercase Latin letters contained in this block. The 26 basic letter pairs form the core of the alphabets used by all the other languages that use the Latin script. A stream of text using one of these alphabets would therefore intermix characters from the Basic Latin block and other Latin blocks. Occasionally a few of the basic letter pairs are omitted, such as in Italian, which does not use “j” or “w”.

160 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin

Alternative Graphics. Common typographical variations include the open- and closed- loop form of the lowercase letters “a” and “g”. systems, such as IPA and , make a distinction between such forms.

Letters of the Latin-1 Supplement: U+00C0–U+00FF The Latin-1 supplement extends the basic 26 letter pairs of ASCII by providing additional letters for major (listed in the next paragraph). Like ASCII, the Latin- 1 set also includes a miscellaneous set of punctuation and mathematical signs. Punctua- tion, signs, and symbols not included in the Basic Latin and Latin-1 Supplement blocks are encoded in character blocks starting with the General Punctuation block. Languages. The languages supported by the Latin-1 supplement include Danish, Dutch, Faroese, Finnish, Flemish, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Span- ish, and Swedish. Ordinals. U+00AA    and U+00BA    can be depicted with an underscore, but many modern fonts show them as superscripted Latin letters with no underscore. In sorting and searching, these characters should be treated as weakly equivalent to their Latin character equivalents. Spacing Clones of Diacritics. ISO 8859-1 contains eight characters that are ambiguous regarding whether they denote combining characters or separate spacing characters. In the Unicode Standard, the corresponding codepoints (U+005E ^  , U+005F _  , U+0060 Å  , U+007E ~ , U+00A8 ¨ , U+00AF ¯ , U+00B4 ´  , and U+00B8 ¸ ) are restricted to use as spacing characters. The Unicode Standard provides unambiguous combining characters in the character block for Combining Diacritical Marks, which can be used to represent accented Latin letters by means of composed character sequences. U+00B0 °   is also occasionally used ambiguously by implementations of ISO 8859-1 to denote a spac- ing form of a diacritic ring above a letter; in the Unicode Standard, that spacing diacritical mark is denoted unambiguously by U+02DA °  . U+007E “~”  is ambigu- ous between usage as a spacing form of a diacritic and as an operator or other punctuation; it is generally rendered with a center line glyph, rather than as a diacritic raised tilde. The spacing form of the diacritic tilde is denoted unambiguously by U+02DC “Á”  .

Latin Extended-A: U+0100–U+017F The Latin Extended-A block contains a collection of letters that, when added to the letters contained in the Basic Latin and Latin-1 Supplement blocks, allow for the representation of most European languages that employ the Latin script. Many other languages can also be written with the characters in this block. Most of these characters are equivalent to precom- posed combinations of base character forms and combining diacritical marks. These com- binations may also be represented by means of composed character sequences. See Section 2.6, Combining Characters. Standards. This block includes characters contained in International Standard ISO 8859— Part 2. Latin alphabet No. 2, Part 3. Latin alphabet No. 3, Part 4. Latin alphabet No. 4, and Part 9. Latin alphabet No. 5. Many of the other graphic characters contained in these stan- dards, such as punctuation, signs, symbols, and diacritical marks, are already encoded in the Latin-1 Supplement block. Other characters from these parts of ISO 8859 are encoded in other blocks, primarily in the Spacing Modifier Letters block (U+02B0..U+02FF) and in the character blocks starting at and following the General Punctuation block.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 161 7.1 Latin European Alphabetic Scripts

Languages. Most languages supported by this block also require the concurrent use of characters contained in the Basic Latin and Latin-1 Supplement blocks. When combined with these two blocks, the Latin Extended-A block supports Afrikaans, Basque, Breton, Catalan, Croatian, Czech, , Estonian, French, Frisian, Greenlandic, Hungarian, Latin, Latvian, Lithuanian, Maltese, Polish, Provençal, Rhaeto-Romanic, Romanian, Romany, Sami, Slovak, Slovenian, Sorbian, Turkish, Welsh, and many others. Alternative Glyphs. Some characters have alternative representations, although they have a common semantic. In such cases, a preferred glyph is chosen to represent the character in the code charts, even though it may not be the form used under all circumstances. Some examples to illustrate this point are provided in Figure 7-1 and discussed in the text that follows. Figure 7-1. Alternative Glyphs @AU BCD, EF LR ST MN VW XYZ

When Czech is printed in books, U+010F       and U+0165       letter forms with apostrophe are often used instead of letter forms with caron (hacek) over the base forms. In Slovak, this use also applies to U+013E      . The use of an apostrophe can avoid some line crashes over the ascenders of those letters and so result in better typography. In typewritten or handwritten documents, or in didactic and pedagogical material, on the other hand, let- ter forms with haceks are preferred. Languages other than Czech or Slovak that make use of these characters may simply choose to always use the forms with haceks. A similar situation can be seen in the Latvian letter U+0123      . In good Latvian typography, this character is always shown with a rotated comma over the g, rather than a cedilla below the g, because of the typographical design and layout issues resulting from trying to place a cedilla below the descender loop of the g. Poor Latvian fonts may substitute an acute accent for the rotated comma, and handwritten or other printed forms may actually show the cedilla below the g. The uppercase form of the letter is always shown with a cedilla, as the rounded bottom of the G poses no problems for attachment of the cedilla. Other Latvian letters with cedilla below (U+0137      , U+0146      , and U+0157      ) always prefer a glyph with a floating comma below as there is no proper attach- ment point for a cedilla proper at the bottom of the base form. In Turkish and Romanian, a cedilla and a comma below can replace one another depending on the font style. The letters U+015F       and U+0163       (and their uppercase counterparts) have been dupli- cated at U+0219        and U+021B       . The duplicated characters with explicit commas below are provided solely for compatibility with sociopolitical practices. Legacy encodings for these

162 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin characters, including ISO/IEC 8859-2, contain only a single form of each of these charac- ters, which is mapped to the form with cedilla. In general, characters with or below are subject to variable typographical usage, depending on the availability and quality of fonts used, the technology, and the geo- graphic area. Various hooks, commas, and squiggles may be substituted for the nominal forms of these diacritics below, and even the direction of the hooks may be reversed. Imple- menters should take care to become familiar with particular typographical traditions before assuming that characters are missing or are wrongly represented in the code charts in the Unicode Standard. Exceptional Case Pairs. The characters U+0130        and U+0131      (used primarily in Turkish) are assumed to take ASCII “i” and “I” as their case alternates, respectively. This mapping makes the corre- sponding reverse mapping language-specific; mapping in both directions requires special attention from the implementor (see Section 5.17, Sorting and Searching). See SpecialCas- ing.txt on the CD-ROM for more information. Diacritics on i and j. A dotted (normal) i or j followed by a top nonspacing mark loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with i + diaeresis. Just as Cyrillic A is not equivalent to Latin A, a dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases of accented dotted-i equivalent to accented dotless-i (for exam- ple, i + ¨Å – + ¨). The same pattern is used for j. To express the forms sometimes used in the Baltic (where the dot is retained under a top accent), use i + overdot + accent (see Figure 7-2). Figure 7-2. Diacritics on i and j

i + Y ➠ • i + \ + Z ➠ W j + [ ➠ X i + Z + \ ➠ V

Latin Extended-B: U+0180–U+024F The Latin Extended-B block contains letter forms used to extend Latin scripts to represent additional languages. It also contains phonetic symbols not included in the International Phonetic Alphabet (see the IPA Extensions block, U+0250..U+02AF). Standards. This block covers, among others, characters in ISO 6438 Documentation— African coded character set for bibliographic information interchange, Pinyin Latin tran- scription characters from the People’s Republic of China national standard GB 2312 and from the Japanese national standard JIS X 0212, and Sami characters from ISO 8859 Part 10. Latin alphabet No. 6. Arrangement. The characters are arranged in a nominal , followed by a small collection of Latinate forms. Upper- and lowercase pairs are placed together where possible, but in many instances the other case form is encoded at some distant location and so is cross-referenced. Variations on the same base letter are arranged in the following order: turned, inverted, hook attachment, stroke extension or modification, different style (script), small cap, modified basic form, ligature, and Greek-derived. Croatian Digraphs Matching Serbian Cyrillic Letters. Serbo-Croatian is a single language with paired alphabets: a Latin script (Croatian) and a Cyrillic script (Serbian). A set of

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 163 7.1 Latin European Alphabetic Scripts compatibility digraph codes is provided for one-to-one . There are two potential uppercase forms for each digraph, depending on whether only the initial letter is to be capitalized (titlecase), or both (all uppercase). The Unicode Standard offers both forms so that software can convert one form to the other without changing font sets. The appropriate cross-references are given for the lowercase letters. For more information about canonical equivalence, see Chapter 3, Conformance. Pinyin Diacritic-Vowel Combinations. The Chinese standard GB 2312, as well as the Japa- nese standard JIS X 0212, includes a set of codes for Pinyin, used for Latin transcription of . Most of the letters used in Pinyin (even those with com- bining diacritical marks) are already covered in the preceding Latin blocks. The group of 16 characters provided here completes the Pinyin character set specified in GB 2312 and JIS X 0212. Case Pairs. A number of characters in this block are uppercase forms of characters whose lowercase form is part of some other grouping. Many of these characters came from the International Phonetic Alphabet; they acquired novel uppercase forms when they were adopted into Latin script-based writing systems. Occasionally, however, alternative upper- case forms arose in this process. In some instances, research has shown that alternative uppercase forms are merely variants of the same character. If so, such variants are assigned a single Unicode value, as is the case of U+01B7    . But when research has shown that two uppercase forms are actually used in different ways, then they are given different codes; such is the case for U+018E      and U+018F    . In this instance, the shared lowercase form is copied to enable unique case-pair mappings if desired: U+01DD      is a copy of U+0259    . For historical reasons, the names of some case pairs differ. For example, U+018E      is the uppercase of U+01DD     —not of U+0258     . (For default case mappings of Unicode characters, see Chapter 4, Character Properties.) Languages. Some indication of language or other usage is given for most characters within the names lists accompanying the character charts.

IPA Extensions: U+0250–U+02AF The IPA Extensions block contains primarily the unique symbols of the International Pho- netic Alphabet (IPA), which is a standard system for indicating specific speech sounds. The IPA was first introduced in 1886 and has undergone occasional revisions of content and usage since that time. The Unicode Standard covers all single symbols and all diacritics in the last published IPA revision (1989), as well as a few symbols in former IPA usage that are no longer currently sanctioned. A few symbols have been added to this block that are part of the transcriptional practices of Sinologists, Americanists, and other linguists. Some of these practices have usages independent of the IPA and may use characters from other Latin blocks rather than IPA forms. Note also that a few nonstandard or obsolete phonetic symbols are encoded in the Latin Extended-B block. An essential feature of IPA is the use of combining diacritical marks. IPA diacritical mark characters are coded in the Combining Diacritical Marks block, U+0300..U+036F. In IPA, diacritical marks can be freely applied to base form letters to indicate fine degrees of pho- netic differentiation required for precise recording of different languages. Standards. The characters in this block are taken from the 1989 revision of the Interna- tional Phonetic Alphabet, published by the International Phonetic Association. The Inter- national Phonetic Association standard considers IPA to be a separate alphabet, so it

164 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin includes the entire Latin lowercase alphabet a–z, a number of extended Latin letters such as U+0153     œ, and a few Greek letters and other symbols as sepa- rate and distinct characters. In contrast, the Unicode Standard does not duplicate either the Latin lowercase letters a–z or other Latin or Greek letters in encoding IPA. Note that unlike other character standards referenced by the Unicode Standard, IPA constitutes an extended alphabet and phonetic transcriptional standard, rather than a character encoding standard. Unifications. The IPA symbols are unified as much as possible with other letters, albeit not with nonletter symbols like U+222B  . The IPA symbols have also been adopted into the Latin-based alphabets of many written languages, such as some used in Africa. It is futile to attempt to distinguish a transcription from an actual alphabet in such cases. Therefore, many IPA symbols are found outside the IPA Extensions block. IPA symbols that are not found in the IPA Extensions block are listed as cross-references at the begin- ning of the character names list for this block. IPA Alternates. In a few cases IPA practice has, over time, produced alternate forms, such as U+0269     “ι” versus U+026A      “.” The Unicode Standard provides separate encodings for the two alternate forms because they are used in a meaningfully distinct fashion. Case Pairs. IPA does not sanction case distinctions; in effect, its phonetic symbols are all lowercase. When IPA symbols are adopted into a particular alphabet and used by a given written language (as has occurred, for example, in Africa) they acquire uppercase forms. Because these uppercase forms are not themselves IPA symbols, they are generally encoded in the Latin Extended-B block (or other Latin extension blocks) and are cross-referenced with the IPA names list. Typographic Variants. IPA includes typographic variants of certain Latin and Greek letters that would ordinarily be considered variations of font style rather than of character iden- tity, such as   letter forms. Examples include a typographic variant of the Greek letter φ, as well as the borrowed letter Greek iota ι, which has a unique Latin uppercase form. These forms are encoded as separate characters in the Unicode Standard because they have distinct semantics in plain text. Digraph Ligatures. IPA officially sanctions six digraph ligatures used in tran- scription of coronal . These are encoded at U+02A3..U+02A8. The IPA digraph ligatures are explicitly defined in IPA and also have possible semantic values that make them not simply rendering forms. Thus, for example, while U+02A6      is a transcription for the sounds that could also be transcribed in IPA as “ts” U+0074 U+0073, the choice of the digraph ligature may be the result of a deliberate dis- tinction made by the transcriber regarding the systematic phonetic status of the affricate. The choice of whether to ligate cannot be left to rendering software based on the font avail- able. This ligature also differs in typographical design from the ts ligature found in some old-style fonts. Encoding Structure. The IPA Extensions block is arranged in approximate alphabetical order according to the Latin letter that is graphically most similar to each symbol. This order has nothing to do with a phonetic arrangement of the IPA letters.

Latin Extended Additional: U+1E00–U+1EFF The characters in this block constitute a number of precomposed combinations of Latin letters with one or more general diacritical marks. Each of the characters contained in this block may be alternatively represented with a base letter followed by one or more general diacritical mark characters found in the Combining Diacritical Marks block. A canonical form for such alternative representations is specified in Chapter 3, Conformance.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 165 7.1 Latin European Alphabetic Scripts

Vietnamese Vowel Plus Tone Mark Combinations. A portion of this block (U+1EA0.. U+1EF9) comprises vowel letters of the modern (qu¶c ngÔ) com- bined with a diacritic mark that denotes the phonemic tone that applies to the syllable. In the modern Vietnamese alphabet, there are 12 vowel letters and five tone marks (see Figure 7-3). Figure 7-3. Vietnamese Letters and Tone Marks a ì â e ê i o ô † u • y þ€þþôþìþˆ

Some implementations of Vietnamese systems prefer storing the combination of vowel let- ter and tone mark as a singly encoded element; other implementations prefer storing the vowel letter and tone mark separately. The former implementations will use characters defined in this block along with combination forms defined in the Latin-1 Supplement and Latin Extended-A character blocks; the latter implementations will use the basic vowel let- ters in the Basic Latin, Latin-1 Supplement, and Latin Extended-A blocks along with char- acters from the Combining Diacritical Marks block. For these latter implementations, the characters U+0300  , U+0309 c  , U+0303 -  , U+0301  , and U+0323 c   should be used in representing the Vietnamese tone marks. The characters U+0340     and U+0341     are deprecated and should not be used.

Latin Ligatures: FB00–FB06 This section of the Alphabetic Presentation forms block (U+FB00..U+FB4F) contains sev- eral common Latin ligatures, which occur in legacy encodings. By design, the Unicode standard does not provide a general mechanism to indicate where ligatures should be dis- played. Where to place a Latin ligature is a matter of typographical style as well as the orthographical rules of the language. Some languages prohibit ligatures across word boundaries. In these cases, it is preferable for the implementations to use unligated charac- ters in the backing store and provide out-of-band information to the display layer where ligatures may be placed.

166 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.2 Greek

7.2 Greek

Greek: U+0370–U+03FF The Greek script is used for writing the and (in an extended variant) the Coptic language. The Greek script had a strong influence in the development of the Latin and Cyrillic scripts. The Greek script is written in linear sequence from left to right with the occasional use of nonspacing marks. Greek letters come in upper- and lowercase pairs. Standards. The Unicode encoding of Greek is based on ISO 8859-7, which is equivalent to the Greek national standard ELOT 928. The Unicode Standard encodes Greek characters in the same relative positions as in ISO 8859-7. A number of variant and archaic characters are taken from the bibliographic standard ISO 5428. Polytonic Greek. Polytonic Greek, used for (classical and Byzantine), may be encoded using either combining character sequences or precomposed base plus diacritic combinations. For the latter, see the following subsection, “Greek Extended: U+1F00– U+1FFF.” Nonspacing Marks. Several nonspacing marks commonly used with the Greek script are found in the Combining Diacritical Marks range (see Table 7-1). Table 7-1. Nonspacing Marks Used with Greek

Code Name Alternative Names U+0300    varia U+0301    tonos, oxia U+0302    U+0303   U+0304   U+0306   U+0308   dialytika U+0313    psili U+0314     dasia U+0342    circumflex, tilde U+0343    comma above U+0345   

Because the characters in the Combining Diacritical Marks block are encoded by shape, not by meaning, they are appropriate for use in Greek where applicable. However, the character U+0344     is deprecated and should not be used. The combination of dialytika plus tonos is instead represented by the sequence U+0308 -   plus U+0301  . Multiple nonspacing marks applied to the same baseform character are encoded in inside- out sequence. See the general rules for applying nonspacing marks in Section 2.6, Combin- ing Characters. The basic Greek accent written in is called tonos. It is represented by an acute accent (U+0301). The shape that the acute accent takes over Greek letters is generally steeper than that shown over Latin letters in Western European typographic traditions, and

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 167 7.2 Greek European Alphabetic Scripts in earlier editions of this standard was mistakenly shown as a vertical line over the vowel. Polytonic Greek has several contrastive accents, and the accent, or tonos, written with an acute accent is referred to as oxia, in contrast to the varia, which is written with a grave accent. U+0342    may appear as either a circumflex or a tilde: 1 versus 2. Because of this variation in form, the perispomeni was encoded distinctly from U+0303  . U+0313    and U+0343    both take the form of a raised comma over a baseform letter. U+0343    was included for compatibility reasons; U+0313    is the preferred form for general use. The nonspacing mark ypogegrammeni (also known as iota subscript in English) can be applied to the vowels alpha, , and to represent historic . This mark appears as a small iota below the vowel. When applied to a single uppercase vowel, the iota does not appear as a subscript, but is instead normally rendered as a regular lowercase iota to the right of the uppercase vowel. This form of the iota is called prosgegrammeni (also known as iota adscript in English). In completely uppercased words, the iota subscript should be replaced by a capital iota. See SpecialCasing.txt on the CD-ROM. Archaic repre- sentations of Greek words (which did not have lowercase or accents) use the Greek capital letter iota following the vowel for these diphthongs. Such archaic representations require special case mapping. Variant . U+03A5     has two common forms— one looks essentially like the Latin capital Y, and the other has two symmetric upper branches that curl like rams’ horns, “:”. The Y-form glyph has been chosen consistently for use in the code charts, both for monotonic and polytonic Greek. For mathematical usage, the rams’ horn form of the glyph is required. Variant forms of certain other Greek letters are encoded as separate characters in ISO/IEC 8859-7 and ISO 5428; therefore, those forms are also included as separate characters in the Unicode Standard. They include U+03C2      and U+03D0   , as well as additional forms of the capital letter that have an asymmetric hook—for example, U+03D2     . Greek Letters as Symbols. For compatibility purposes, a few Greek letters are separately encoded as symbols in other character blocks. Examples include U+00B5   µ in the Latin-1 Supplement character block and U+2126   Ω in the Letterlike Symbols character block. The use of Greek letters for mathematical variables and operators is well established. Characters from the Greek block may be used for these symbols. Punctuation-like Characters. The question of which punctuation-like characters are uniquely Greek and which ones can be unified with generic Western punctuation has no definitive answer. The Greek U+037E    erotimatiko “;” is encoded for compatibility. The preferred character is U+003B . Historic Letters. Historic Greek letters have been retained from ISO 5428. Coptic-Unique Letters. The Coptic script is regarded primarily as a stylistic variant of the Greek alphabet. The letters unique to Coptic are encoded in a separate range at the end of the Greek character block. Those characters may be used together with the basic Greek characters to represent the complete . Coptic text may be rendered using a font that contains the Coptic style of depicting the characters it shares with the Greek alphabet. Texts that mix languages together must employ appropriate font style associations.

168 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.2 Greek

Related Characters. For math symbols, see mathematical operators (U+2200..U+22FF). For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).

Greek Extended: U+1F00–U+1FFF The characters in this block constitute a number of precomposed combinations of Greek letters with one or more general diacritical marks; in addition, a number of spacing forms of Greek diacritical marks are provided here. In particular, these characters facilitate the representation of Polytonic Greek texts without the use of combining marks. Each of the characters contained in this block may be alternatively represented with a base letter from the Greek block followed by one or more general diacritical mark characters found in the Combining Diacritical Marks block. A canonical form for such alternative representations is specified in Chapter 3, Conformance. Spacing Diacritics. Sixteen additional spacing diacritic marks are provided in this charac- ter block for use in the representation of Polytonic Greek texts. Each has an alternative rep- resentation for use with systems that support nonspacing marks. The Unicode Standard considers the nonspacing alternative forms to be the canonical Unicode representation of the information represented by the spacing forms. The nonspacing alternatives appear in Table 7-2. Table 7-2. Greek Spacing and Nonspacing Pairs

Spacing Form Nonspacing Form 1FBD GREEK KORONIS 0313 COMBINING COMMA ABOVE 037A GREEK YPOGEGRAMMENI 0345 COMBINING YPOGEGRAMMENI 1FBF GREEK PSILI 0313 COMBINING COMMA ABOVE 1FC0 GREEK PERISPOMENI 0342 COMBINING GREEK PERISPOMENI 1FC1 GREEK DIALYTIKA AND 0308 COMBINING DIAERESIS PERISPOMENI + 0342 COMBINING GREEK PERISPOMENI 1FCD GREEK PSILI AND VARIA 0313 COMBINING COMMA ABOVE + 0300 COMBINING GRAVE ACCENT 1FCE GREEK PSILI AND OXIA 0313 COMBINING COMMA ABOVE + 0301 COMBINING ACUTE ACCENT 1FCF GREEK PSILI AND PERISPOMENI 0313 COMBINING COMMA ABOVE + 0342 COMBINING GREEK PERISPOMENI 1FDD GREEK DASIA AND VARIA 0314 COMBINING REVERSED COMMA ABOVE + 0300 COMBINING GRAVE ACCENT 1FDE GREEK DASIA AND OXIA 0314 COMBINING REVERSED COMMA ABOVE + 0301 COMBINING ACUTE ACCENT 1FDF GREEK DASIA AND PERISPOMENI 0314 COMBINING REVERSED COMMA ABOVE + 0342 COMBINING GREEK PERISPOMENI 1FED GREEK DIALYTIKA AND VARIA 0308 COMBINING DIAERESIS + 0300 COMBINING GRAVE ACCENT 1FEE GREEK DIALYTIKA AND OXIA 0308 COMBINING DIAERESIS + 0301 COMBINING ACUTE ACCENT 1FEF GREEK VARIA 0300 COMBINING GRAVE ACCENT 1FFD GREEK OXIA 0301 COMBINING ACUTE ACCENT 1FFE GREEK DASIA 0314 COMBINING REVERSED COMMA ABOVE

Decomposition of Spacing Forms. When decomposing the spacing forms, the spacing sta- tus of the implied usage must be taken into account. Unless information is present to the

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 169 7.2 Greek European Alphabetic Scripts contrary, these spacing forms would be decomposed to U+0020  followed by the nonspacing form equivalents shown in Table 7-2. In archaic forms of Greek, U+0345    and the precom- posed forms that contain it have special case mappings.

170 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.3 Cyrillic

7.3 Cyrillic

Cyrillic: U+0400–U+04FF The Cyrillic script is a member of the family of scripts strongly influenced by the Greek script. Cyrillic has traditionally been used for writing various , among which Russian is predominant. In the nineteenth and early twentieth centuries, Cyrillic was extended to write the non-Slavic minority languages of the former Soviet . The Cyrillic script is written in linear sequence from left to right with the occasional use of non- spacing marks. Cyrillic letters come in upper- and lowercase pairs. Standards. The Cyrillic block of the Unicode Standard is based on ISO 8859-5. The Uni- code Standard encodes Cyrillic characters in the same relative positions as in ISO 8859-5. Unifications. Latin characters included in those alphabets that use both Latin and Cyrillic letters are not given duplicate Cyrillic encodings. Examples include and w for Kurdish. Historic Letters. The historic form of the Cyrillic alphabet is treated as a font style varia- tion of modern Cyrillic because the historic forms are relatively close to the modern appearance and because some of them are still in modern use in languages other than Rus- sian (for example, U+0406     “I”is used in modern Ukrainian and Byelorussian). Since the historic Cyrillic characters encoded in Unicode (U+0460.. U+0486) rarely occur in modern form, these letters are shown in the charts in an archaic font. A complete Old Cyrillic set would be obtained by rendering the whole Cyrillic section (that is, U+0400..U+0486) in that same style. Extended Cyrillic. These letters are used in alphabets for minority languages of the former . The scripts of some of these languages have often been revised in the past; the Unicode Standard includes only the alphabets in current use, not the rejected old letter- forms. Glagolitic. The history of the creation of the Slavic scripts and their relationship has been lost. The Unicode Standard regards Glagolitic as a separate script from Cyrillic, not as a font change from Cyrillic. This position is taken primarily because Glagolitic appears unrecog- nizably different from Cyrillic, and secondarily because Glagolitic has not grown to match the expansion of Cyrillic. The is not currently supported by the Unicode Standard.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 171 7.4 Armenian European Alphabetic Scripts

7.4 Armenian

Armenian: U+0530–U+058F The Armenian script is used primarily for writing the Armenian language. The script is written from left to right and generally does not use diacritics (except for the modifier let- ters specified below). It does have upper- and lowercase pairs. Modifier Letters. In modern Armenian typography, the small marks in the group called Armenian modifier letters are placed above and to the right of other letters so that they occupy a letter position of their own. For example, the mark, the exclamation mark, and the interrogation mark are positioned to the right of the vowel in the syllable that is stressed. As the rendering of these modifiers may involve horizontal and vertical adjustments, it should ideally be done with enhanced kerning as described in Section 5.15, Locating Text Element Boundaries. Because these modifiers generally have their own width, they are treated as spacing letters in the Unicode Standard, rather than as nonspacing marks. There appears to be no recorded usage of U+0559       in Armenian text; its presence in this block is thus probably spurious. Punctuation Letters. The Armenian writing system uses many punctuation letters from other blocks, such as U+002C  and U+00B7  . However, when used in Armenian, such punctuation letters should be rendered in a style compatible with the Armenian font style. In addition to U+055D  , which is already included in the modifier letters, two punctuation letters are specific to Armenian: U+0589 -    and U+058A  . The behavior of the latter character is similar to U+00AD  . It is used to indicate a line-breaking opportunity within a polysyllabic Armenian word. Its shape distinguishes it from the soft hyphen. Ligatures. Five Armenian ligatures are encoded in the Alphabetic Presentation Forms block in the range U+FB13..U+FB17. By design, the Unicode Standard does not provide a general mechanism to indicate where ligatures should be displayed. For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).

172 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.5 Georgian

7.5 Georgian

Georgian: U+10A0–U+10FF The Georgian script is used primarily for writing the . Upper- and low- ercase pairs exist primarily in archaic forms of the script. Script Forms. The modern Georgian script is a lowercase style called mkhedruli (soldier’s). It originated as the secular derivative of a form called khutsuri (ecclesiastical) that had both uppercase and lowercase pairs. Although no longer used in most modern texts, the khutsuri style is still used for liturgical purposes; the Unicode Standard encodes the uppercase form of khutsuri as well as the lowercase letters of modern Georgian. Georgian Paragraph Separator. The Georgian paragraph separator has a distinct repre- sentation, so it has been separately encoded at U+10FB. It visually marks a paragraph end, but it must be followed by a formatting character such as U+2029   to cause a paragraph break. See the discussion of the paragraph separator in Section 13.2, Layout Controls. Other Punctuation. For the Georgian full stop, use U+0589   . For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 173 7.6 Runic European Alphabetic Scripts

7.6 Runic

Runic: U+16A0–U+16F0 The Runic script was historically used to write the languages of the early and medieval soci- eties in the German, Scandinavian, and Anglo-Saxon areas. Use of the Runic script in vari- ous forms covers a period from the first century to the nineteenth century. Some 6,000 Runic inscriptions are known. They form an indispensable source of information about the development of the Germanic languages. Historical Script. The Runic script is one of the first “historical” or “extinct” scripts to be incorporated into the Unicode Standard. The only important use of today is in schol- arly and popular works about the old Runic inscriptions and their interpretation. The Runic script illustrates many technical problems that are typical for this kind of script. Unlike other scripts in the Unicode Standard, which predominantly serve the needs of the modern user community—with occasional extensions for historic forms—the encoding of the Runic script attempts to suit the needs of texts from different periods of time and from distinct societies that had little contact with each other. Direction. Like other early writing systems, runes could be written either from left to right or from right to left, or moving first in one direction, then the other (boustrophedon), or following the outlines of the inscribed object. At times, characters appear in mirror image, or upside down, or both. In modern scholarly literature, Runic is written left to right. Therefore, the letters of the Runic script have a default directionality of strong left-to-right in this standard. The Runic Alphabet. Present-day knowledge about runes is incomplete. The set of graphe- mically distinct units shows greater variation in its graphical shapes than most modern scripts. The Runic alphabet changed several times during its history, both in the number and shapes of the letters contained in it. The shape of most runes can be related to some Latin capital letter, but not necessarily with a letter representing the same sound. The most conspicuous difference between the Latin and the Runic alphabets is the order of the let- ters. The Runic alphabet is known as the futhark from the name of its first six letters. The origi- nal old futhark contained 24 runes: ‡‹ –— œž ¡£¦ ¨¬­®¯ ´·»¼¿ÁÃÄ They are usually transliterated in this way: f u ã arkgw hnij Ô pz s t be ml ° do

In England and Friesland, seven additional runes were added from the fifth century to the ninth century. In the Scandinavian countries, the futhark changed in a different way; in the eighth century, the simplified appeared. It consists of only 16 runes, some of which are used in two different forms. The long-branch form is shown here: ‡‹‘–™ ¡£¦ ª° ´·½¿ Ë f u ã or k hni as t bml þ

174 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.6 Runic

The use of runes continued in Scandinavia during the Middle Ages. During that time, the futhark was influenced by the Latin alphabet and new runes were invented so that there was full correspondence with the Latin letters. Representative Glyphs. The known inscriptions can include considerable variations of shape for a given rune, sometimes to the point where the nonspecialist will mistake the shape for a different rune. There is no dominant main form for some runes, particularly for many runes added in the Anglo-Friesian and medieval Nordic systems. When transcribing a Runic inscription into its Unicode-encoded form, one cannot rely on the idealized refer- ence glyph shape in the character charts alone. One must take into account to which of the four Runic systems an inscription belongs, and be knowledgeable about the permitted form variations within each system. The reference glyphs were chosen to provide an image that distinguishes each rune visually from all other runes in the same system. For actual use, it might be advisable to use a separate font for each Runic system. Of particular note is the fact that the glyph for U+16C4 ©    is actually a rare form, as the more common form is already used for U+16E1 Æ   . Unifications. When a rune in an earlier writing system evolved into several different runes in a later system, the unification of the earlier rune with one of the later runes was based on similarity in graphic form rather than similarity in sound value. In cases where a substan- tial change in the typical graphical form has occurred, though the historical continuity is undisputed, unification has not been attempted. When runes from different writing sys- tems have the same graphic form but a different origin and denote different sounds, they have been coded as separate characters. Long-Branch and Short-Twig. Two sharply different graphic forms, the long-branch and the short-twig form, were used for nine of the 16 Viking Age Nordic runes. Although only one form is used in a given inscription, there are runologically important exceptions. In some cases, the two forms were used to convey different meanings in later use in the medi- eval system. Therefore the two forms have been separated in the Unicode Standard. Staveless Runes. Staveless runes are a third form of the Viking Age Nordic runes, a kind of runic . The number of known inscriptions is small and the graphic forms of many of the runes show great variability between inscriptions. For this reason, staveless runes have been unified with the corresponding Viking Age Nordic runes. The correspond- ing Viking Age Nordic runes must be used to encode these characters, specifically the short-twig characters, where both short-twig and long-branch characters exist. Punctuation Marks. The wide variety of Runic punctuation marks has been reduced to three distinct characters based on simple aspects of their graphical form, as very little is known about any difference in intended meaning between marks that look different. Any other punctuation marks have been unified with shared punctuation marks elsewhere in the Unicode Standard. Golden Numbers. Runes were used as symbols for Sunday letters and golden numbers on calendar staves used in Scandinavia during the Middle Ages. To complete the number series 1–19, three additional calendar runes were added. They are included after the punctuation marks. Encoding. A total of 81 characters of the Runic script are included in the Unicode Standard. Of these, 75 are Runic letters, 3 are punctuation marks, and 3 are Runic symbols. The order of the Runic characters follows the traditional futhark order, with variants and derived runes inserted directly after the corresponding ancestor. Runic character names are based as much as possible on the sometimes several traditional names for each rune, often with the Latin transliteration at the end of the name.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 175 7.7 Ogham European Alphabetic Scripts

7.7 Ogham

Ogham: U+1680–U+169F Ogham is an alphabetic script devised to write a a very early form of Irish. Monumental Ogham inscriptions are found in Ireland, Wales, Scotland, England, and on the Isle of Man. Many of the Scottish inscriptions are undeciphered and may be in Pictish. It is probable that Ogham (Old Irish “Ogam”) was widely written in wood in early times. The main flow- ering of “classical” Ogham, rendered in monumental stone, was in the fifth and sixth cen- turies. Such inscriptions were mainly employed as territorial markers and memorials; the more ancient examples are standing stones. The script was originally written along the edges of stone where two faces meet; when writ- ten on paper, the central “stemlines” of the script can be said to represent the edge of the stone. Inscriptions written on stemlines cut into the face of the stone, instead of along its edge, are known as “scholastic” and are of a later date (post-seventh century). Notes were also commonly written in Ogham in as recently as the sixteenth century. Structure. The Ogham alphabet consists of 26 distinct characters (feda), the first 20 of which are considered to be primary, the last 6 (forfeda) supplementary. The four primary series are called aicmí (plural of aicme, meaning “family”). Each aicme was named after its first character, (Aicme Beithe, Aicme Uatha, meaning “the B Family,” “the H Family,” and so forth). The character names used in this standard reflect the spelling of the names in modern Irish Gaelic, except that the acute accent is stripped from Úr, Éabhadh, Ór, and Ifín, and the mutation of nGéadal is not reflected. Rendering. Ogham text is read beginning from the bottom left side of a stone, continuing upward, across the top and down the right side (in the case of long inscriptions). Monu- mental Ogham was incised chiefly in a bottom-to-top direction, though there are examples of left-to-right bilingual inscriptions in Irish and Latin. Ogham accommo- dated the horizontal left-to-right direction of the Latin script, and the vowels were written as vertical strokes as opposed to the incised notches of the inscriptions. Ogham should therefore be rendered on computers from left-to-right or from bottom-to-top (never start- ing from top-to-bottom). Forfeda (Supplementary Characters). In printed and in manuscript Ogham, the fonts are conventionally designed with a central stemline, but this convention is not necessary. In implementations without the stemline, the character U+1680    should be given its conventional width and simply left blank like U+0020 . U+169B    and U+169C     are used at the beginning and the end of Ogham text, particularly in manuscript Ogham. In some cases, only the Ogham feather mark is used, which can indicate the direction of the text.

176 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.8 Modifier Letters

7.8 Modifier Letters

Spacing Modifier Letters: U+02B0–U+02FF Modifier letters are an assorted collection of small signs that are generally used to indicate modifications of a preceding letter. A few may modify the following letter, and some may serve as independent letters. These signs are distinguished from diacritical marks in that modifier letters are treated as free-standing, spacing characters. They are distinguished from similar or identical appearing punctuation or symbols by the fact that the members of this block are considered to be letter characters that do not break up a word. They have the “letter” character property (see Chapter 4, Character Properties). The majority of these signs are phonetic modifiers, including the characters required for coverage of the Interna- tional Phonetic Alphabet (IPA). Phonetic Usage. Modifier letters have relatively well-defined phonetic interpretations. Their usage generally indicates a specific articulatory modification of a sound represented by another letter or intended to convey a particular level of stress or tone. In phonetic usage, the modifier letters are sometimes called “diacritics,” which is correct in the logical sense that they are modifiers of the preceding letter. However, in the Unicode Standard, the term “diacritical marks” refers specifically to nonspacing marks, whereas the codes in this block specify spacing characters. For this reason, many of the modifier letters in this block correspond to separate diacritical mark codes, which are cross-referenced in Section 14.1, Character Names List. Encoding Principles. This block includes characters that may have different semantic val- ues attributed to them in different contexts. It also includes multiple characters that may represent the same semantic values—there is no necessary one-to-one relationship. The intention of the Unicode encoding is not to resolve the variations in usage, but merely to supply implementers with a set of useful forms from which to choose. The list of usages given for each modifier letter should not be considered exhaustive. For example, the glottal stop (Arabic hamza) in Latin transliteration has been variously represented by the charac- ters U+02BC   , U+02BE     , and U+02C0    . Conversely, an apostrophe can have several uses; for a list, see the entry for U+02BC    in the character names list. There are also instances where an IPA modifier letter is explicitly equated in semantic value to an IPA nonspacing diacritic form. Latin Superscripts. Graphically, some of the phonetic modifier signs are raised or super- scripted, some are lowered or subscripted, and some are vertically centered. Only those few forms that have specific usage in IPA or other major phonetic systems are encoded. Spacing Clones of Diacritics. Some corporate standards explicitly specify spacing and nonspacing forms of combining diacritical marks, and the Unicode Standard provides matching codes for these interpretations when practical. A number of the spacing forms are covered in the Basic Latin and Latin-1 Supplement blocks. The six common European diacritics that do not have encodings there are added as spacing characters. These forms can have multiple semantics, such as U+02D9  , which is used as an indicator of the Mandarin Chinese fifth tone. Rhotic Hook. U+02DE     is defined in IPA as a free-standing modifier letter. However, in common usage, it is treated as a ligated hook on a baseform let- ter. Hence, U+0259     + U+02DE     may be treated as equivalent to U+025A      .

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 177 7.8 Modifier Letters European Alphabetic Scripts

Tone Letters. U+02E5..U+02E9 comprise a set of basic tone letters, defined in IPA and commonly used in detailed tone transcriptions of African and other languages. Each refers to one of five distinguishable tone levels. To represent contour tones, the tone letters are used in combinations. The rendering of contour tones follows a regular set of ligation rules that results in a graphic image of the contour (see Figure 7-4). Figure 7-4. Tone Letters +=

178 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.9 Combining Marks

7.9 Combining Marks

Combining Diacritical Marks: U+0300–U+036F The combining diacritical marks in this block are intended for general use with any script. Diacritical marks specific to a particular script are encoded with the alphabet for that script. Diacritical marks that are primarily used with symbols are defined in the Combin- ing Diacritical Marks for Symbols character block (U+20D0..U+20FF). Standards. The combining diacritical marks are derived from a variety of sources, includ- ing IPA, ISO 5426, and ISO 6937. Sequence of Base Letters and Diacritics. In the Unicode character encoding, all nonspac- ing marks, including diacritics, are encoded after the base character. For example, the Uni- code character sequence U+0061 “a”    , U+0308 “þ”í  , U+0075 “u”     unambiguously encodes “äu”, not “aü”. The Unicode Standard convention is consistent with the logical order of other nonspacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. This convention is also in line with the way mod- ern font technology handles the rendering of nonspacing glyphic forms, so that mapping from character memory representation to rendered glyphs is simplified. (For more infor- mation on the use of diacritical marks, see Chapter 2, General Structure, and Chapter 3, Conformance.) Diacritics Positioned Over Two Base Characters. IPA and a few languages such as Tagalog use diacritics that are applied to two baseform characters. These marks apply to the previ- ous base character—just like all other combining nonspacing marks—but hang over the following letter as well. The two characters U+0360    and U+0361     are intended to be displayed as depicted in Figure 7-5. Figure 7-5. Double Diacritics ~ ~ o + @ ➠ o ~ ~ o + @ + o ➠ oo

These double diacritics always bind more loosely than all other nonspacing marks except U+0345 iota subscript, and thus sort near the end in the canonical representation. In ren- dering, the double diacritic will float above other diacritics (excluding surrounding diacrit- ics), as in Figure 7-6. Figure 7-6. Ordering of Double Diacritics ~ ~ o + @ˆ + @ + o + @¨ ➠ ôö ~ ~ o + @@ + @ˆ + o + ¨ ➠ ôö

Marks as Spacing Characters. By convention, combining marks may be exhibited in (apparent) isolation by applying them to U+0020  or to U+00A0 - .

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 179 7.9 Combining Marks European Alphabetic Scripts

This approach might be taken, for example, when referring to the diacritical mark itself as a mark, rather than using it in its normal way in text. The use of U+0020  versus U+00A0 -  affects line-breaking behavior. In charts and illustrations in this standard, the combining nature of these marks is illus- trated by applying them to U+25CC  , as shown in the examples throughout this standard. The Unicode Standard separately encodes clones of many common European diacritical marks as spacing characters. These related characters are cross-referenced in the character names list. Encoding Principles. Because nonspacing marks have such a wide variety of applications, the characters in this block may have multiple semantic values. For example, U+0308 = diaeresis = umlaut = double derivative. There are also cases of several different Unicode characters for equivalent semantic values; variants of  include at least U+0312    , U+0326   , and U+0327  . (For more information about the difference between nonspacing marks and combining characters, see Chapter 2, General Structure.) Glyphic Variation. When rendered in the context of a language or script, like let- ters, combining marks may be subjected to systematic stylistic variation. For example, when used in Polish, U+0301    appears at a steeper angle than when it is used in French. When it is used for Greek (as oxia), it can appear nearly upright. U+030C   is commonly rendered as an apostrophe when used with cer- tain letterforms. U+0326    is sometimes rendered as U+0312     on a lowercase “g” to avoid conflict with the decender. In many fonts, there is no clear distinction made between    and U+0327  . Combining accents above the base glyph are usually adjusted in height for use with upper- case versus lowercase forms. In the absence of specific font protocols, combining marks are often designed as if they were applied to typical base characters in the same font. For more information, see Section 5.14, Rendering Nonspacing Marks.

Combining Marks for Symbols: U+20D0–U+20FF Diacritical marks for symbols are generally applied to mathematical or technical symbols. They can be used to extend the range of the symbol set. For example, U+20D3      can be used to express negation. Its presentation may change in those circumstances, changing length or slant. That is, U+2261   followed by U+20D3 is equivalent to U+2262   . In this case, there is a pre- composed form for the negated symbol. However, this statement does not always hold true, and U+20D3 can be used with other symbols to form the negation. For example, U+2258   followed by U+20D3 can be used to express does not correspond to, with- out requiring that a precomposed form be part of the Unicode Standard. Other nonspacing characters can also be used in mathematical expressions, of course. For example, a U+0304   is commonly used in propositional logic to indi- cate logical negation. Enclosing Diacritics. These nonspacing characters are supplied for compatibility with existing standards, allowing individual base characters to be enclosed in several ways. For example, U+2460    Å can be expressed as U+0030   “1” + U+20DD    O. As with other combining characters, this one can also be applied productively; circled letter alef can be produced by the sequence:

180 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.9 Combining Marks

U+05D0    µ + U+20DD    O. The com- bining enclosing diacritics cannot be used to enclose a sequence of base characters in plain text. For example, there is no way to represent U+246A    with the  , as there is no single character  .

Combining Half Marks: U+FE20–U+FE2F This block consists of a number of presentation form (glyph) encodings that may be used to visually encode certain combining marks that apply to multiple base letterforms. These characters are intended to facilitate the support of such combining marks in legacy imple- mentations. Unlike the other compatibility characters, these characters do not correspond to a single nominal character or a sequence of nominal characters; rather, a discontiguous sequence of these corresponds to a single combining mark, as depicted in Figure 7-7. The preferred forms are the double diacritics, U+0360 and U+0361. Figure 7-7. Combining Half Marks

Combining Half Marks~ ~ ~ ng+ þ + + þ → ng

U+006E U+FE22 U+0067 U+FE23

Single Combining Mark~ ~ n + þ + g → ng

U+006E U+0360 U+0067

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 181 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 8 Middle Eastern Scripts 8

The scripts in this category have a common origin in the ancient Phoenician alphabet. They include: •Hebrew •Arabic •Syriac • Thaana The Hebrew script is used in Israel and for languages of the Diaspora. The Arabic script is used to write many languages throughout the Middle East, North Africa, and certain parts of Asia. The Syriac script is used to write a number of Middle Eastern languages. These three scripts also function as major liturgical scripts, and therefore are used worldwide by various religious groups. The Thaana script is used to write Dhivehi, the language of the Republic of Maldives, an island nation in the middle of the Indian Ocean. The Middle Eastern scripts are all alphabetic, with small character sets. Words are demar- cated by spaces. Except for Thaana, these scripts include a number of distinctive punctua- tion marks. In addition, the Arabic script includes traditional forms for digits, called “Arabic-Indic digits” in the Unicode Standard. Text in these scripts is written from right to left. Implementations of these scripts must conform to the Unicode bidirectional algorithm (see Section 3.12, Bidirectional Behavior). Arabic and Syriac are scripts even when typeset, unlike Hebrew and Thaana, where letters are unconnected. Most letters in Arabic and Syriac assume different forms depend- ing on their position in a word. Shaping rules for the rendering of text are specified in the block descriptions for Arabic and Syriac. Shaping rules are not required for Hebrew because only five letters have position-dependent final forms, and these forms are sepa- rately encoded. Historically, Middle Eastern scripts did not include short vowels. Nowadays, short vowels are represented by marks positioned above or below a consonantal letter. Vowels and other marks of pronunciation (“vocalization”) are encoded as combining characters, so support for vocalized text necessitates use of composed character sequences. Syriac, Thaana, and are normally written with vocalization; Hebrew and Arabic are usually written unvocalized.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 185 8.1 Hebrew Middle Eastern Scripts

8.1 Hebrew

Hebrew: U+0590–U+05FF The Hebrew script is used for writing the as well as Yiddish, Judezmo (Ladino), and a number of other languages. Vowels and various other marks are written as points, which are applied to consonantal base letters; these marks are usually omitted in Hebrew, except for liturgical texts and other special applications. Five Hebrew letters assume a different graphic form when last in a word. Directionality. The Hebrew script is written from right to left. Conformant implementa- tions of Hebrew script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior.) Cursive. The Unicode Standard uses the term cursive to refer to writing where the letters of a word are connected. A handwritten form of Hebrew is known as cursive, but its rounded letters are generally unconnected, so the Unicode definition does not apply. Fonts based on exist: they are used not only to show examples of Hebrew handwriting, but also for display purposes. Standards. ISO 8859-8—Part 8. Latin/Hebrew Alphabet. The Unicode Standard encodes the Hebrew alphabetic characters in the same relative positions as in ISO 8859-8; however, there are no points or characters in the ISO standard. Vowels and Other Marks of Pronunciation. These combining marks, generically called points in the context of Hebrew, indicate vowels or other modifications of consonantal let- ters. General rules for applying combining marks are given in Chapter 2.6, Combining Characters, and Section 3.10, Canonical Ordering Behavior. Additional Hebrew-specific behavior is described below. Hebrew points can be separated into four classes: dagesh, shin dot and sin dot, vowels, and other marks of punctuation. Dagesh, U+05BC   , has the form of a dot that appears inside the let- ter that it affects. It is not a vowel, but a diacritic that affects the pronunciation of a conso- nant. The same base consonant can also have a vowel and/or other diacritics. Dagesh is the only element that goes inside a letter. The dotted Hebrew consonant shin is explicitly encoded as the sequence U+05E9,    followed by U+05C1,    . The shin dot is positioned on the upper-right side of the undotted base letter. Similarly, the dotted consonant sin is explicitly encoded as the sequence U+05E9,    followed by U+05C2,    . The sin dot is positioned on the upper-left side of the base letter. The two dots are mutually exclusive. The base letter shin can also have a dagesh, a vowel, and other diacritics. Use of the two dots with any other base character is an error. Vowels all appear below the base character that they affect, except for holam, U+05B9   , which appears above left. The following points represent vowels: U+05B0–U+05B9, U+05BB. The remaining three points are marks of pronunciation: U+05BD   , U+05BF   , and U+FB1E   - . Meteg, also known as siluq, goes below the base character; rafe and varika go above. The varika, used in Judezmo, is a glyphic variant of rafe.

186 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.1 Hebrew

Shin and Sin. Separate characters for the dotted letters shin and sin are not included in this block. When it is necessary to distinguish between the two forms, they should be encoded as U+05E9    followed by the appropriate dot (U+05C1 or U+05C2). This practice is consistent with Israeli standard encoding. Final (Contextual Variant) Letterforms. Variant forms of five Hebrew letters are encoded as separate characters in this block, as in Hebrew standards including ISO 8859-8. These variant forms are generally used in place of the nominal letter forms at the end of words. Certain words, however, are spelled with nominal rather than final forms, particularly names and foreign borrowings in Hebrew, and some words in Yiddish. Because final form usage is a matter of spelling convention, software should not automatically substitute final forms for nominal forms at the end of words. The positional variants should be coded directly from the codeset and rendered one-to-one via their own glyphs—that is, without contextual analysis. Yiddish Digraphs. The digraphs are considered to be independent characters in Yiddish. The Unicode Standard has included them as separate characters so as to distinguish certain letter combinations in Yiddish text—for example, to distinguish the digraph double vav from an occurrence of a consonantal vav followed by a vocalic vav. The use of digraphs is consistent with standard . Other letters of the Yiddish alphabet, such as pasekh alef, can be composed from other characters, although alphabetic presentation forms are also encoded. Punctuation. Most punctuation marks used with the Hebrew script are not given indepen- dent codes (that is, they are unified with Latin punctuation), except for the few cases where the mark has a unique form in Hebrew—namely, U+05BE   , U+05C0    (also known as legarmeh), U+05C3  -   , U+05F3   , and U+05F4  -  . Note that for paired punctuation such as parentheses, the glyphs chosen to represent U+0028   and U+0029   will depend upon the direction of the rendered text. See Section 4.7, Mirrored—Normative, for more information. Cantillation Marks. Cantillation marks are used in publishing liturgical texts, including the Bible. There are various historical schools of cantillation marking; the set of marks included in the Unicode Standard follows the Israeli standard SI 1311.2. Positioning. Marks may combine with vowels and other points, and there are complex typographic rules for positioning these combinations. The vertical placement (meaning above, below, or inside) of points and marks is very well defined. The horizontal placement (meaning left, right, or center) of points is also very well defined. The horizontal placement of marks, on the other hand, is not well defined, and convention allows for the different placement of marks relative to their base character. When points and marks are located below the same base letter, the point always comes first (on the right) and the mark after it (on the left), except for the marks yetiv, U+059A   , and dehi, U+05AD   , which come first (on the right) and are followed (on the left) by the point. These rules are followed when points and marks are located above the same base letter: • If the point is holam, all cantillation marks precede it (on the right), except pashta, U+0599   . • Pashta always follows (goes to the left of) points.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 187 8.1 Hebrew Middle Eastern Scripts

• Holam on a sin consonant (shin base + sin dot) follows (goes to the left of) the sin dot. However, the two combining marks are sometimes rendered as a single assimilated dot. • Shin dot and sin dot are generally represented closer vertically to the base letter than other points and marks that go above it.

Alphabetic Presentation Forms: U+FB1D–U+FB4F The Hebrew characters in this block are chiefly of two types: variants of letters and marks encoded in the main Hebrew block, and precomposed combinations of a Hebrew letter or digraph with one or more vowels or pronunciation marks. This block contains all of the vocalized letters of the Yiddish alphabet. The alef lamed ligature and a Hebrew variant of the plus sign are also included. The Hebrew plus sign variant, U+FB29     , is more often used in handwriting than in print, but does occur in school textbooks. It is used by those who wish to avoid cross symbols, which can have reli- gious and historical connotations. Use of Wide Letters. Wide letterforms are used in handwriting and in print to achieve even margins. The wide-form letters in the Unicode Standard are those that are most commonly “stretched” in justification. If Hebrew text is to be rendered with even margins, justification should be left to the text formatting software. These alphabetic presentation forms are included for compatibility purposes. For the pre- ferred encoding, see Hebrew Presentation Forms, U+FB1D–U+FB4F, in the names list. For letterlike symbols, see U+2135..U+2138. For additional punctuation to be used with this script, see Section 6.1, General Punctuation. The    (U+20AA) is encoded in the currency block.

188 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic

8.2 Arabic

Arabic: U+0600–U+06FF The Arabic script is used for writing the Arabic language and has been extended for repre- senting a number of other languages, such as Persian, Urdu, Pashto, Sindhi, and Kurdish. Urdu is often written with the ornate Nastaliq script variety. Some languages, such as Indo- nesian/Malay, Turkish, and Ingush, formerly used the Arabic script and now employ the Latin or Cyrillic scripts. The Arabic script is cursive, even in its printed form (see Figure 8-1). As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow- els and various other marks may be written as combining marks called harakat, which are applied to consonantal base letters. In normal writing, however, these harakat are omitted. Directionality. The Arabic script is written from right to left. Conformant implementa- tions of Arabic script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior.) Figure 8-1. Directionality and Cursive Connection

Standards. ISO 8859-6—Part 6. Latin/. The Unicode Standard encodes the basic Arabic characters in the same relative positions as in ISO 8859-6. ISO 8859-6, in turn, is based on ECMA-114, which was based on ASMO 449. Encoding Principles. The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different con- textual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it. Punctuation. Most punctuation marks used with the Arabic script are not given indepen- dent codes (that is, they are unified with Latin punctuation), except for the few cases where the mark has a significantly different appearance in Arabic—namely, U+060C  , U+061B  , U+061F   , and U+066A   . For paired punctuation such as parentheses, the glyphs chosen to represent U+0028   and U+0029   will depend upon the direction of the rendered text. The Non-joiner and the Joiner. The Unicode Standard provides two user-selectable for- matting codes: U+200C   - and U+200D    (see Figure 8-2, Figure 8-3, and Figure 8-4). The use of a non-joiner between two letters prevents them from forming a cursive connection with each other when rendered. Examples include

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 189 8.2 Arabic Middle Eastern Scripts the Persian plural suffix, some Persian proper names, and vowels. For further discussion of joiners and non-joiners, see Section 13.2, Layout Controls. Figure 8-2. Using Joiner

Figure 8-3. Using Non-joiner

Figure 8-4. Combinations of Joiner and Non-joiner

Harakat (Vowel) Nonspacing Marks. Harakat are marks that indicate vowels or other modifications of consonant letters. The occurrence of a character in the harakat range and its depiction in relation to a dashed circle constitute an assertion that this character is intended to be applied via some process to the character that precedes it in the text stream, the base character. General rules for applying nonspacing marks are given in the Combin- ing Diacritical Marks block description section. The few marks that are placed after (to the left of) the base character are treated as ordinary spacing characters in the Unicode Stan- dard. The Unicode Standard does not specify a sequence order in case of multiple harakat applied to the same Arabic base character as there is no possible ambiguity of interpreta- tion. (For more information about the canonical ordering of nonspacing marks, see Chap- ter 2, General Structure, and Chapter 3, Conformance.) Arabic-Indic Digits. The names for the forms of decimal digits vary widely across different languages. The decimal numbering system originated in India (Devanagari vwx…) and was subsequently adopted in the Arabic world with a different appearance (Arabic •–—˜…). The Europeans adopted decimal numbers from the Arabic world, although once again the forms of the digits changed greatly (European 0123…). The European forms were later adopted widely around the world and are used even in many Arabic- speaking countries in North Africa. In each case, the interpretation of decimal numbers remained the same. However, the forms of the digits changed to such a degree that they are no longer recognizably the same characters. Because of the origin of these characters, the European decimal numbers are widely known as “” or “Hindi-Arabic

190 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic numerals,” whereas the decimal numbers in use in the Arabic world are widely known there as “Hindi numbers.” The Unicode Standard includes both Indic digits (including forms used with different Indic scripts), Arabic digits (with forms used in most of the Arabic world), and European digits (now used internationally). Because of this decision, the traditional names could not be retained without confusion. In addition, there are two main variants of the Arabic digits— those used in Iran and Pakistan (here called Eastern Arabic-Indic) and those used in other parts of the Arabic world. The Persian and Urdu variant digits are given separate codes in the Unicode Standard to account for the differences in appearance and directional treatment when rendering them. The Unicode Standard provides one sequence of digits for both Persian and Urdu. Their presentation depends on the font. (For a complete discussion of directional formatting in the Unicode Standard, see Section 3.12, Bidirectional Behavior.) In summary, the Unicode Standard uses the names shown in Table 8-1 . These names have been chosen to reduce the confusion involved in the use of the decimal . They do not have any normative content; as with the choice of any other names, they are meant to be unique distinguishing labels and should not be viewed as favoring one culture over another. Table 8-1. Digit Names

Name Code Points Forms European U+0030..U+0039 0123456789 Arabic-Indic U+0660..U+0669 •–—˜™š›œž Eastern Arabic-Indic U+06F0..U+06F9 ÕÖ×ØÙÚÛÜÝÞ Indic (Devanagari) U+0966..U+096F vwxyz{|}~

Extended Arabic Letters. Arabic script is used to write major languages, such as Persian and Urdu, but it has also been used to transcribe some relatively obscure languages, such as Baluchi and Lahnda, which have little tradition in printed typography. As a result, the set of characters encoded in this section unavoidably contains spurious forms. The Unicode Standard encodes multiple forms of the Extended Arabic letters and variant digits because the character forms and usages are not well documented for a number of languages. This approach was felt to be the most practical in the interest of minimizing the risk of omitting valid characters. Koranic Annotation Signs. These characters are used in the Koran to mark pronunciation and other annotation. The two enclosing marks, U+06DD and U+06DE, are used to enclose digits. When rendered, these digits appear in a smaller size. Languages. The languages using a given character are occasionally indicated, even though this information is incomplete. When such an annotation ends with an ellipsis (…), then the languages cited are merely the known principal ones among many. Minimum Rendering Requirements. The cursive nature of the Arabic script imposes spe- cial requirements on display or rendering processes that are not typically found in Latin script-based systems. A display process must convert between the logical order in which Arabic characters are placed in backing store and the visual (or physical) order required by the display device. (See Section 3.12, Bidirectional Behavior, for a description of the conver- sion between logical and visual orders.) At a minimum, a display process must also select an appropriate glyph to depict each Ara- bic letter according to its immediate joining context; furthermore, it must substitute certain

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 191 8.2 Arabic Middle Eastern Scripts ligature glyphs for sequences of Arabic characters. The remainder of this section specifies a minimum set of rules that provide legible Arabic joining and ligature substitution behavior.

Cursive Joining Joining Classes. Each Arabic letter must be depicted by one of a number of possible con- textual glyph forms. The appropriate form is determined on the basis of its joining class and the joining class of adjacent characters. Each Arabic character falls into one of the classes shown in Table 8-2 (see ArabicShaping.txt on the CD-ROM for a complete list). Table 8-3 defines derived superclasses of the primary Arabic joining classes; those super- classes are used in the cursive joining rules. In both tables, right and left refer to visual order. Table 8-2. Primary Arabic Joining Classes

Joining Class Symbols Members Right-joining R , , , ,  Left-joining L None Dual-joining D , , ,  … Join-causing C   ,  () These characters are distinguished from the dual-joining characters in that they do not change shape. Non-joining U   - and all spacing characters (other than the above), including ,  , spaces, digits, punctuation, non-Arabic letters, and so on Transparent T All combining marks and format marks, including - , , , , , , ,  , - , and so on Table 8-3. Derived Arabic Joining Classes

Joining Class Members Right join-causing Superset of dual-joining, left-joining, and join-causing Left join-causing Superset of dual-joining, right-joining, and join-causing

Joining Rules. The following rules describe the joining behavior of Arabic letters in terms of their display (visual) order. In other words, the positions of letterforms in the included examples are presented as they would appear on the screen after the bidirectional algorithm has reordered the characters of a line of text. • An implementation may choose to restate the following rules according to logi- cal order so as to apply before the bidirectional algorithm’s reordering phase. In this case, the words right and left as used in this section would become preceding and following. In the following rules, if X refers to a character, then various glyph types representing that character are referred to as shown in Table 8-4. Table 8-4. Arabic Glyph Types

Glyph Types Description

Xn Nominal glyph form as it appears in the code charts Xr Right-joining glyph form (both right-joining and dual-joining characters may employ this form)

192 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic

Table 8-4. Arabic Glyph Types (Continued)

Glyph Types Description

Xl Left-joining glyph form (both left-joining and dual-joining characters may employ this form) Xm Dual-joining (medial) glyph form that joins on both left and right (only dual- joining characters employ this form)

R1 Transparent characters do not affect the joining behavior of base (spacing) charac- ters. For example: MEEM.N + SHADDAH.N + LAM.N Å MEEM.R + SHADDAH.N + LAM.L

R2 A right-joining character X that has a right join-causing character on the right will adopt the form Xr . For example: ALEF.N + TATWEEL.N Å ALEF.R + TATWEEL.N

R3 A left-joining character X that has a left join-causing character on the left will adopt the form Xl. R4 A dual-joining character X that has a right join-causing character on the right and a left join-causing character on the left will adopt the form Xm. For example: TATWEEL.N + MEEM.N + TATWEEL.N Å TATWEEL.N + MEEM.M + TATWEEL.N

R5 A dual-joining character X that has a right join-causing character on the right and no left join-causing character on the left will adopt the form Xr. For example: MEEM.N + TATWEEL.N Å MEEM.R + TATWEEL.N

R6 A dual-joining character X that has a left join-causing character on the left and no right join-causing character on the right will adopt the form Xl. For example: TATWEEL.N + MEEM.N Å TATWEEL.N + MEEM.L

R7 If none of the above rules applies to a character X, then it will adopt the nominal form Xn. As just noted, the   - may be used to prevent joining, as in the Per- sian (Farsi) plural suffix or Ottoman Turkish vowels.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 193 8.2 Arabic Middle Eastern Scripts

Ligatures Ligature Classes. Certain types of ligatures are obligatory in Arabic script regardless of font design. Many other optional ligatures are possible, depending on font design. Because they are optional, those ligatures are not covered. For the purpose of describing the obligatory Arabic ligatures, certain Unicode characters fall into the following classes (see the end of this block description for a complete list): Alef-types: --,    … Lam-types: ,    ,     … These two classes are designated below as ALEF and LAM, respectively. Ligature Rules. The following rules describe the formation of ligatures. They are applied after the preceding joining rules. Like the joining rules just discussed, the following rules describe ligature behavior of Arabic letters in terms of their display (visual) order. In the following rules, if X and Y refer to characters, then various glyph types representing combinations of these characters are referred to as shown in Table 8-5. Table 8-5. Ligature Notation

Symbol Description

(X.Y)n Nominal ligature glyph form representing a combination of an Xr form and a Yl form (X.Y)r Right-joining ligature glyph form representing a combination of an Xr form and a Ym form (X.Y)l Left-joining ligature glyph form representing a combination of an Xm form and a Yl form (X.Y)m Dual-joining (medial) ligature glyph form representing a combination of an Xm form and a Ym form

L1 Transparent characters do not affect the ligating behavior of base (nontranspar- ent) characters. For example: ALEF.R + FATHAH.N + LAM.L Å LAM-ALEF.N + FATHAH.N

L2 Any sequence with ALEFr on the left and LAMm on the right will form the ligature (ALEF.LAM)l. For example:

L3 Any sequence with ALEFr on the left and LAMl on the right will form the ligature (ALEF.LAM)n. For example:

• From the perspective of logical (or reading) order, the preceding (ALEF.LAM) ligature forms are referred to as - forms. The difference reflects the use of visual order rather than logical order to state ligature rules. Optional Features. Many other ligatures and contextual forms are optional—depending on the font and application. Some of these presentation forms are encoded in the ranges FB50..FDFB and FE70..FEFE. However, these forms should not be used in general inter- change. Moreover, it is not expected that every Arabic font will contain all of these forms, nor that these forms will include all presentation forms used by every font.

194 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic

More sophisticated rendering systems will use additional shaping and placement. For example, contextual placement of the nonspacing vowels such as fatha will provide better appearance. The justification of Arabic tends to stretch words instead of adding width to spaces. Basic stretching can be done by inserting tatweel between characters shaped by rules R2, R4..R6, L2, and L3; the best places for inserting tatweel will depend on the font and rendering software. More powerful systems will choose different shapes for characters such as kaf to fill the space in justification. Arabic Character Joining Types. Table 8-6, Table 8-7, and Table 8-8 provide a detailed list of the Arabic characters that are either right-joining or dual-joining. All other Arabic char- acters (aside from ) are non-joining, including U+06D5   . For brevity in the names, the words , , and  are abbreviated using digits. Most of the extended Arabic characters are merely variations on the basic Arabic shapes, with additional or different marks. For compatibility, many precomposed forms are included. In some cases, characters occur only at the end of words in correct spelling; they are called trailing characters. Examples include  ,  , and . When trailing characters are joining (such as  ), they are classed as right-join- ing, even when similarly shaped characters are dual-joining. When characters do not join or cause joining (such as ), they are classified as transparent.  • In the case of U+0647 , the glyph HEHl is shown in the code charts. This form is often used to reduce the chance of misidentifying  as U+0665 -    , which has a very similar shape. The nominal form of  is the isolate form, which looks like U+06D5   . In the case of U+06C1  , the nominal form is not even listed; it also resembles -   . The characters in these tables are grouped by shape and not by standard Arabic alphabeti- cal order. For a machine readable version of the information in these tables, see Arabic- Shaping.txt on the CD-ROM. Table 8-6. Dual-Joining Arabic Characters

Other Characters with Similar Group X X X X n r m l Shaping Behavior All letters based on the BEH form, includ- BEH ‚ƒò ñing TEH, THEH, and diacritic variants of these. 0628, 062A, 062B, 0679..0680. All letters based on the NOON form. 0646, NOON ÊËÍ Ì 06B9..06BD. All letters based on the YEH form, includ- YEH Ö× Ù Øing ALEF MAKSURA. 0626, 0649, 064A, 0678, 06CC, 06CE, 06D0, 06D1. All letters based on the HAH form, includ- HAH †‡‰ˆing KHAH, JEEM, and diacritic variants of these. 062C..062E, 0681..0687, 06BF. All letters based on the SEEN form, includ- SEEN – — ™˜ing SHEEN and diacritic variants of these. 0633, 0634, 069A..069C, 06FA. All letters based on the SAD form, includ- SAD žŸ¡ ing DAD and diacritic variants of these. 0635, 0636, 069D, 069E, 06FB.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 195 8.2 Arabic Middle Eastern Scripts

Table 8-6. Dual-Joining Arabic Characters (Continued)

Other Characters with Similar Group X X X X n r m l Shaping Behavior All letters based on the TAH form, includ- TAH ¦§©¨ing ZAH and diacritic variants of these. 0637, 0638, 069F. All letters based on the AIN form, includ- AIN ®¯±°ing GHAIN and diacritic variants of these. 0639, 063A, 06A0, 06FC. All letters based on the FEH form. 0641, FEH ¶· ¹ ¸ 06A1..06A6.

All letters based on the QAF form. 0642, QAF § »½¼ 06A7, 06A8.

MEEM ÆÇÉÈAll letters based on the MEEM form. 0645.

HEH ÎÏÑÐAll letters based on the HEH form. 0647.

KNOTTED ’‘All letters based on the KNOTTED HEH HEH form. 06BE. All letters based on the HEH GOAL form, HEH GOAL ‹ŒŽbut excluding HAMZA ON HEH GOAL. 06C1. All letters based on the KAF form. 0643, KAF ¾¿ÁÀ 06AC..06AE. SWASH >?@ AAll letters based on the SWASH KAF form. KAF 06AA.

All letters based on the form. 06A9, GAF òôù õ 06AB, 06AF..06B4.

All letters based on the LAM form. 0644, LAM ÂÃÅ Ä 06B5..06B8.

Table 8-7. Right-Joining Arabic Characters

Group Xn Xr Other Characters with Similar Shaping Behavior All letters based on the ALEF form. 0622, 0623, 0625, ALEF ê 0627, 0671, 0672, 0673, 0675.

All letters based on the form. 0624, 0648, 0676, WAW ÒÓ 0677, 06C4..06CB, 06CF.

All letters based on the DAL form, including THAL and DAL Ž diacritic variants of these. 062F, 0630, 0688..0690.

All letters based on the REH form, including ZAIN and REH ’“ diacritic variants of these. 0631, 0632, 0691..0699. All letters based on the HEH form that show a knotted TEH MARB- ôõform on right-joining, including HAMZA ON HEH. UTA 0629, 06C0.

196 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic

Table 8-7. Right-Joining Arabic Characters (Continued)

Group Xn Xr Other Characters with Similar Shaping Behavior All letters based on the HEH form that show a goal form HAMZA ON ¥§on right-joining (Urdu), including HAMZA ON HEH HEH GOAL GOAL. 06C2, 06C3. YEH WITH BCThe tailed form of YEH. 06CD. TAIL YEH “”All letters based on the YEH BARREE form (Urdu). 06D2, BARREE 06D3.

Table 8-8. Other Arabic Character Joining Classes

Class Members Join-causing ZERO WIDTH JOINER (200D) and TATWEEL (0640). ZERO WIDTH NON-JOINER (200C), and all spacing characters (other than those mentioned in Table 8-6 or Table 8-7) are non-joining, includ- Non-joining ing HAMZA (0621), HIGH HAMZA (0674), spaces, digits, punctuation, non-Arabic letters, and so on. All combining marks and other format controls are irrelevant, including FATHATAN (064B) and other points, MADDAH ABOVE (0653), Transparent HAMZA ABOVE (0654), HAMZA BELOW (0655), SUPERSCRIPT ALEF (0670), combining Koranic annotation signs, RIGHT-TO-LEFT MARK (200F), and so on.

Arabic Presentation Forms-A: U+FB50–U+FDFF This block contains a list of presentation forms (glyphs) encoded as characters for compat- ibility. At the time of publication, there are no known implementations of all of these pre- sentation forms. As with all other compatibility encodings, these characters have a preferred encoding that makes use of noncompatibility characters. The presentation forms in this block consist of contextual (positional) variants of Extended Arabic letters, contextual variants of Arabic letter ligatures, spacing forms of Arabic dia- critic combinations, contextual variants of certain Arabic letter/diacritic combinations, and Arabic phrase ligatures. The ligatures include a large set of presentation forms for com- patibility with existing systems. However, the set of ligatures appropriate for any given Ara- bic font will generally not match this set precisely. Fonts will often include only a subset of these glyphs, and they may also include glyphs outside of this set. These glyphs are gener- ally not accessible as characters and are used only by rendering engines. The alternative (ornate) forms of parentheses for use with the Arabic script are not com- patibility characters.

Arabic Presentation Forms-B: U+FE70–U+FEFF This block contains additional Arabic presentation forms consisting of spacing or tatweel forms of , contextual variants of primary Arabic letters, and the obligatory - ligature. They are included here for compatibility with preexisting standards and legacy implementations that use these forms as characters. They can be replaced by letters from the Arabic block (U+0600..U+06FF). Implementations can handle contextual glyph

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 197 8.2 Arabic Middle Eastern Scripts shaping by rendering rules when accessing glyphs from fonts, rather than by encoding con- textual shapes as characters. Spacing and Tatweel Forms of Arabic Diacritics. For compatibility with certain imple- mentations, a set of spacing forms of the Arabic diacritics is provided here. The tatweel forms are combinations of the joining connector tatweel and a diacritic. Zero Width No-Break Space. This character (U+FEFF), which is not an Arabic presenta- tion form, is described in the Specials block (U+FFF0..U+FFFF).

198 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac

8.3 Syriac

Syriac: U+0700–U+074F The belongs to the Aramaic branch of the family of Semitic languages. The earliest datable Syriac writing dates from the year 6 . Syriac is the active liturgical lan- guage of many communities in the Middle East (Syrian Orthodox, Assyrian, Maronite, Syr- ian Catholic, and Chaldaean) and Southeast India (Syro-Malabar and Syro-Malankara). It is also the native language of a considerable population in these communities. Syriac is divided into two dialects. West Syriac is used by the Syrian Orthodox, , and Syrian Catholics. East Syriac is used by the Assyrians (that is, Ancient ) and Chaldaeans. The two dialects are very similar with almost no difference in gram- mar and vocabulary. They differ in pronunciation and use different dialectal forms of the Syriac script. Syriac Languages. A number of modern languages and dialects employ the Syriac script in one form or another. They include the following: 1. Literary Syriac. The primary usage of Syriac script. 2. -Aramaic dialects. The Syriac script is widely used for modern Aramaic lan- guages, next to Hebrew, Cyrillic, and Latin. A number of Eastern Modern Ara- maic dialects known as Swadaya (also called vernacular Syriac, modern Syriac, modern Assyrian, and so on, and spoken mostly by the Assyrians and Chaldae- ans of Iraq, Turkey, and Iran), and the Central Aramaic dialect, Turoyo (spoken mostly by the Syrian Orthodox of the Tur Abdin region in southeast Turkey), belong to this category of languages. 3. (Arabic written in the Syriac script). It is currently used for writing Arabic liturgical texts by Syriac-speaking Christians. Garshuni employs the Arabic set of vowels and marks. 4. Christian Palestinian Aramaic (known also as Palestinian Syriac). This dialect is no longer spoken. 5. Other languages. The Syriac script was used in various historical periods for writing Armenian and some Persian dialects. Syriac speakers employed it for writing Arabic, Ottoman Turkish, and Malayalam. The Syriac Script. The Syriac script is cursive and has shaping rules that are similar to those for Arabic. The Unicode Standard does not include any presentation form characters for Syriac characters. Directionality. The Syriac script is written from right to left. Conformant implementations of Syriac script must use the Unicode bidirectional algorithm (see Section 3.12, Bidirec- tional Behavior). Syriac Type Styles. Syriac texts employ several type styles. Because all type styles use the same Syriac characters, even though their shapes vary to some extent, the Unicode Stan- dard encodes only a single Syriac script. 1. Estrangela type style. Estrangela (a word derived from Greek strongulos, mean- ing “rounded”) is the oldest type style. Ancient manuscripts use this writing style exclusively. Estrangela is used today in West and East Syriac texts for writ- ing headers, titles, and subtitles. It is the current standard in writing Syriac texts in Western scholarship.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 199 8.3 Syriac Middle Eastern Scripts

2. Serto or West Syriac type style. This type style is the most cursive of all Syriac type styles. It emerged around the eighth century and is used today in West Syr- iac texts, as well as Turoyo (Central Neo-Aramaic) and Garshuni. 3. East Syriac type style. Its early features appear as early as the sixth century; it developed into its own type style by the twelfth or thirteenth centuries. It is used today for writing East Syriac texts, as well as Swadaya (Eastern Neo-Ara- maic). It is also used today in West Syriac texts for headers, titles, and subtitles alongside the Estrangela type style. 4. Christian Palestinian Aramaic. Manuscripts of this dialect employ a script that is akin to Estrangela. It can be considered a subcategory of Estrangela. The Unicode Standard provides for usage of the type styles mentioned above. Additionally, it accommodates letters and diacritics used in Neo-Aramaic languages, Christian Palestin- ian Aramaic, and Garshuni languages. Examples are supplied in the Serto type style, except where otherwise noted. Character Names. Character names follow the East Syriac convention for naming the let- ters of the alphabet. Diacritical points use a descriptive naming—for example,   . The Syriac Abbreviation Mark. U+070F    (SAM) is a user- selectable zero-width formatting code that has no effect on the shaping process of Syriac characters. The SAM specifies the beginning point of a Syriac abbreviation, which is a line drawn horizontally above one or more characters, at the end of a word or of a group of characters followed by a character other than a Syriac letter or diacritic mark. A Syriac abbreviation may contain Syriac diacritics. Ideally, the Syriac abbreviation is rendered by a line that has a dot at each end and the cen- ter as in the examples. While not preferable, it has become acceptable for computers to ren- der the Syriac abbreviation as a line without the dots. The line is acceptable for the presentation of Syriac in plain text, but the presence of dots is recommended in liturgical texts. The Syriac abbreviation is used for letter numbers and contractions. A Syriac abbreviation generally extends from the last tall character in the word until the end of the word. A com- mon exception to this rule is found with letter numbers that are preceded by a preposition character, as seen in Figure 8-5. Figure 8-5. Syriac Abbreviation

= 15 (number in letters) = on the 15th (number with prefix) = =

200 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac

A SAM is placed before the character where the abbreviation begins. The Syriac abbrevia- tion begins over the character following the SAM and continues until the end of the word. Use of the SAM is demonstrated in Figure 8-6. Figure 8-6. Use of SAM

Backing Store: 3£K% Reversed: %K£3 Rendered:

Note: Modern East Syriac texts employ a punctuation mark for contractions of this sort. Ligatures and Combining Characters. Only one ligature is included in the Syriac block— U+071E    . This combination is used as a unique character in the same manner as an æ ligature. A number of combining diacritics unique to Syriac are encoded, but combining characters from other blocks are also used, especially from the Arabic block. Diacritic Marks and Vowels. The function of the diacritic marks varies: they indicate vow- els (like Arabic and Hebrew), mark grammatical attributes (for example, verb versus noun, interjection), or guide the reader in the pronunciation and/or reading of the given text. “The reader of the average Syriac manuscript or book, is confronted with a bewildering profusion of points. They are large, of medium size and small, arranged singly or in twos and threes, placed above the word, below it, or upon the line.” There are two vocalization systems. The first, attributed to Jacob of Edessa (633–708 ), utilizes letters derived from Greek that are placed above (or below) the characters they modify. The second is the more ancient dotted system, which employs dots in various shapes and locations to indicate vowels. East Syriac texts exclusively employ the dotted sys- tem, whereas West Syriac texts (especially later ones and in modern times) employ a mix- ture of the two systems. Diacritic marks are nonspacing and are normally centered above or below the character. Exceptions to this rule follow: 1. U+0741   and U+0742   are used only with the letters beth, gamal (in its Syriac and Garshuni forms), galath, , pe, and . • The qushshaya indicates that the letter is pronounced hard and unaspirated. • The rukkakha indicates that the letter is pronounced soft and aspirated. When the rukkakha is used in conjunction with the dalath, it is printed slightly to the right of the dalath’s dot below. 2. In Modern Syriac usage, when a word contains a rish and a seyame, the dot of the rish and the seyame are replaced by a rish with two dots above it. 3. The feminine dot is usually used to the left of a final taw. Punctuation. Most punctuation marks used with Syriac are found in the Latin-1 and Ara- bic blocks. The other ones are encoded in this block.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 201 8.3 Syriac Middle Eastern Scripts

Digits. Modern Syriac employs European numerals, as does Hebrew. The ordering of num- bers follows the same scheme as in Hebrew. Harklean Marks. The Harklean marks are used in the Harklean of the New Tes- tament. U+070B    and U+070D    mark the beginning of a phrase, word, or morpheme that has a marginal note. U+070C    marks the end of such sections. Dalath and Rish. Prior to the development of pointing, early Syriac texts did not distin- guish between a dalath and a rish, distinguished in later periods with a dot below the former and a dot above the latter. Unicode provides U+0716      as an ambiguous character. Semkath. Unlike other letters, the joining mechanism of semkath varies through the course of history from right-joining to dual-joining. It is necessary to enter a U+200C   - character after the semkath to obtain the right-joining form where required. There are two common variants of this character, U+0723    and U+0724    . They occur interchangeably in the same document, similar to the case of Greek sigma. Vowel Marks. As shown in the preceding examples, the so-called Greek vowels may be used above or below letters. As West Syriac texts employ a mixture of the Greek and dotted sys- tems, both versions are accounted for here. Miscellaneous Diacritics. Miscellaneous general diacritics are also used in Syriac text. Their usage is explained in Table 8-9. Table 8-9. Miscellaneous Diacritic Use U+0303, U+0330 These are used in Swadaya to indicate letters not found in Syriac. U+0304, U+0320 These are used for various purposes ranging from phonological to grammatical to orthographic markers. U+0307, U+0323 These points are used for various purposes—grammatical, phono- logical, and otherwise. They differ typographically and semantically from the qushshaya and rukkakha points, as well as the dotted vowel points. U+0308 This is the plural marker. It is also used in Garshuni for the Arabic teh marbuta. U+030A, U+0325 These are two other forms for the indication of qushshaya and rukka- kha. They are used interchangeably with U+0741  -  and U+0742  , especially in West Syriac grammar books. U+0324 This diacritic mark is found in ancient manuscripts. It has a gram- matical and phonological function. U+032D This is one of the digit markers. U+032E This is a mark used in late and modern East Syriac texts, as well as Swadaya, to indicate a pe.

Use of Characters of the Arabic Block. Syriac makes use of several characters from the Ara- bic block, including tatweel (U+0640). Modern texts use U+060C  , U+061B  , and U+061F   . The shadda (U+0651) is also used in the core part of literary Syriac on top of a waw in the word “O”. Arabic har- akat are used in Garshuni to indicate the corresponding Arabic vowels and diacritics.

202 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac

Syriac Shaping Minimum Rendering Requirements. Rendering requirements for Syriac are similar to those for Arabic. The remainder of this section specifies a minimum set of rules that pro- vides legible Syriac joining and ligature substitution behavior. Joining Classes. Each Syriac character is represented by up to four possible contextual glyph forms. The form used is determined by its joining class and the joining class of the letter on each side. These classes are identical in behavior to those outlined for Arabic, with the addition of three extra classes that determine the behavior of final alaphs. See Table 8-10. Table 8-10. Additional Syriac Joining Classes

Joining Class Description Afj Final joining (alaph only) Afn Final non-joining except following dalath and rish (alaph only) Afx Final non-joining following dalath and rish (alaph only)

R1 An alaph that has a left-joining character to its right and a word-breaking charac- ter to its left will take the form of Afj.

R2 An alaph that has a non-left-joining character to its right, except for a dalath or rish, and a word-breaking character to its left will take the form of Afn.

R3 An alaph that has a dalath or rish to its right and a word-breaking character to its left will take the form of Afx.

The above example is in the East Syriac font style.

Syriac Cursive Joining Table 8-11, Table 8-12, and Table 8-13 provide listings of how each character is shaped in the appropriate joining type. Syriac characters not shown are non-joining. The shaping classes are included in ArabicShaping.txt on the CD-ROM. Table 8-11. Right-Joining Syriac Characters

Character Xn Xr

DALATH 

DOTLESS DALATH RISH 

HE 

WAW

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 203 8.3 Syriac Middle Eastern Scripts

Table 8-11. Right-Joining Syriac Characters (Continued)

Character Xn Xr

ZAIN

YUDH HE 

SADHE 

RISH 

TAW 

Table 8-12. Dual-Joining Syriac Characters

Character Xn Xr Xm Xl BETH 

GAMAL 

GAMAL GARSHUNI  !

HETH "#$%

TETH &'()

TETH GARSHUNI *+,-

YUDH +/01

KAPH 2345

LAMADH 67 8 9

MIM :;<=

NUN >?@A

SEMKATH BCDE

SEMKATH FINAL FGHI

E JKLM

PE NOPQ

PE REVERSED RSTU

QAPH VWXY

SHIN Z[\]

204 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac

Table 8-13. Alaph-Joining Syriac Characters

Type Style An Ar Afj Afn Afx Estrangela hiihh

Serto (West Syriac) ^__``

East Syriac 3 ag43

Ligatures Ligature Classes. As in other scripts, ligatures in Syriac vary depending upon the font style. Table 8-14 identifies the principal valid ligatures for each font style. When applicable, these ligatures are obligatory, unless denoted with an (*). Table 8-14. Syriac Ligatures

Serto (West Characters Estrangela East Syriac Sources Syriac) ALAPH LAMADH N/A Dual-joining N/A Beth Gazo GAMAL LAMADH N/A Dual-joining* N/A Armalah GAMAL E N/A Dual-joining* N/A Armalah HE YUDH N/A N/A Right-joining* Qdom YUDH TAW N/A Right-joining* N/A Armalah* KAPH LAMADH N/A Dual-joining* N/A Shhimo KAPH TAW N/A Right-joining* N/A Armalah LAMADH SPACE ALAPH N/A Right-joining* N/A Nomocanon LAMADH ALAPH Right-joining* Right-joining Right-joining* BFBS LAMADH LAMADH N/A Dual-joining* N/A Shhimo ALAPH N/A Right-joining* N/A Shhimo SEMKATH TETH N/A Dual-joining* N/A Qurobo SADHE NUN Right-joining* Right-joining* Right-joining* Mushhotho RISH SEYAME Right-joining Right-joining Right-joining BFBS TAW ALAPH Right-joining* N/A Right-joining* Qdom TAW YUDH N/A N/A Right-joining*

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 205 8.4 Thaana Middle Eastern Scripts

8.4 Thaana

Thaana: U+0780–U+07BF The Thaana script is used to write the modern Dhivehi language of the Republic of Maldives, a group of atolls in the Indian Ocean. Like the Arabic script, Thaana is written from right to left and uses vowel signs, but it is not cursive. The basic Thaana letters have been extended by a small set of dotted letters used to transcribe Arabic. The use of modified Thaana letters to write Arabic dates from the middle of the twentieth century. Loan words from Arabic may be written in the Arabic script, although this custom is not very prevalent today. (See Section 8.2, Arabic.) Directionality. The Thaana script is written from right to left. Conformant implementa- tions of Thaana script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior). Vowels. Consonants are always written with either a vowel sign (U+07A6..U+07AF) or the null vowel sign (U+07B0  ). U+0787    with the null vowel sign denotes a glottal stop. Numerals. Both European (U+0030..U+0039) and Arabic digits (U+0660..U+0669) are used. European numbers are used more commonly, and have left-to-right display direc- tionality in Thaana. Arabic numeric punctuation is used with digits, whether Arabic or European. Punctuation. The Thaana script uses spaces between words. It makes use of a mixture of Arabic and European punctuation, with rules of usage not clearly defined. Sentence-final punctuation is now generally shown with a single period (U+002E “.”  ), but may also use a sequence of two periods (U+002E followed by U+002E). Phrases may be sepa- rated with a comma (usually U+060C  ) or with a single period (U+002E). Colons, dashes, and double quotation marks are also used in the Thaana script. In addi- tion, Thaana makes use of U+061F    and U+061B  - . Character Names and Arrangement. The character names are based on the names used in the Republic of Maldives. The character name at U+0794, yaa, is found in some sources as yaviyani, but the former is more common today. Characters are listed in Thaana alphabet- ical order from haa to ttaa for the Thaana letters, followed by the extended characters in Arabic alphabetical order from hhaa to waavu.

206 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 9 South and Southeast Asian Scripts 9

The following scripts are encoded in this group: • Devanagari •Bengali •Gurmukhi •Gujarati •Oriya •Tamil •Telugu •Kannada • Malayalam •Sinhala •Thai •Lao •Tibetan •Myanmar •Khmer The scripts of South and Southeast Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letter- forms. With minor historical exceptions, they are written from left to right; many use no interword spacing but use spaces or marks between phrases. They are all essentially alpha- betic scripts in which most symbols stand for a consonant plus an (usually the sound “ah”). Word-initial vowels in many of these scripts have special symbols, but word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign. Most of the scripts of South and Southeast Asia, from the in the north to Sri Lanka in the south, from Pakistan in the west to the eastern-most islands of Indonesia, are derived from the ancient . The oldest lengthy inscriptions of India, the Asoka edicts of the third century , were written in two scripts, and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 209 South and Southeast Asian Scripts important administrative language of the Middle East at that time. Kharosthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendents of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of San- skrit literature and the archetypal representative of the northern branch of the script fam- ily. This northern branch includes such modern scripts as Myanmar (Burmese), Gurmukhi, and Tibetan; the southern branch includes scripts such as Malayalam. The major scripts of India proper, including Devanagari, are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based upon the Indian national standard (ISCII) encoding for these scripts. Sinhala, Myanmar, and Khmer have virama-based models that do not map to ISCII. Thai and Lao share a different model, based on the Thai Industrial Standard encoding for Thai, which uses visual order- ing and is not laid out compatibly to ISCII. Tibetan stands apart from all of these models, reflecting its somewhat different structure and usage. Many of the character names in this group of scripts represent the same sounds, and naming conventions are similar across the range.

210 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari

9.1 Devanagari

Devanagari: U+0900–U+097F The Devanagari script is used for writing classical and its modern historical deriv- ative, Hindi. Extensions to Devanagari are used to write other related (such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used to write the following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattis- garhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho, Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari, Palpa, and Santali. All other Indic scripts, as well as the of Sri Lanka, the Tibetan script, and the Southeast Asian scripts (Thai, Lao, Khmer, and Myanmar), are historically connected with the Devanagari script as descendants of the ancient Brahmi script. The entire family of scripts shares a large number of structural features. The principles of the Indic scripts are covered in some detail in this introduction to the Devanagari script. The remaining introductions to the Indic scripts are abbreviated but highlight any differences from Devanagari where appropriate. Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (Indian Standard Code for Information Interchange). The ISCII standard of 1988 differs from and is an update of earlier ISCII standards issued in 1983 and 1986. The Unicode Standard encodes Devanagari characters in the same relative position as those coded in positions A0–F416 in the ISCII-1988 standard. The same character code layout is followed for eight other Indic scripts in the Unicode Standard: Bengali, Gurmukhi, Guja- rati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout emphasizes the structural similarities of the Brahmi scripts and follows the stated intention of the Indian coding standards to enable one-to-one mappings between analogous coding posi- tions in different scripts in the family. Sinhala, Thai, Lao, Khmer, and Myanmar depart to a greater extent from the Devanagari structural pattern, so the Unicode Standard does not attempt to provide any direct mappings for these scripts to the Devanagari order. In November 1991, at the time The Unicode Standard, Version 1.0, was published, the Bureau of Indian Standards published a new version of ISCII in Indian Standard (IS) 13194:1991. This new version partially modified the layout and repertoire of the ISCII- 1988 standard. Because of these events, the Unicode Standard does not precisely follow the layout of the current version of ISCII. Nevertheless, the Unicode Standard remains a super- set of the ISCII-1991 repertoire except for a number of new Vedic extension characters defined in IS 13194:1991 Annex G—Extended Character Set for Vedic. Modern, non-Vedic texts encoded with ISCII-1991 may be automatically converted to Unicode code values and back to their original encoding without loss of information. Encoding Principles. The writing systems that employ Devanagari and other Indic scripts constitute a cross between syllabic writing systems and phonemic writing systems (alpha- bets). The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel (CV ) core and, optionally, one or more preceding consonants, with a canonical structure of ((C)C)CV. The orthographic syllable need not correspond exactly with a phonological syllable, especially when a consonant cluster is involved, but the writing system is built on phonological principles and tends to correspond quite closely to pronunciation.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 211 9.1 Devanagari South and Southeast Asian Scripts

The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devana- gari script. These pieces consist of three distinct character types: consonant letters, inde- pendent vowels, and dependent vowel signs. In a text sequence, these characters are stored in logical (phonetic) order.

Principles of the Script Rendering Devanagari Characters. Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. A character’s appearance is affected by its ordering with respect to other characters, the font used to ren- der the character, and the application or system environment. These variables can cause the appearance of Devanagari characters to differ from their nominal glyphs (used in the code charts). Additionally, a few Devanagari characters cause a change in the order of the displayed char- acters. This reordering is not commonly seen in non-Indic scripts and occurs indepen- dently of any bidirectional character reordering that might be required. Consonant Letters. Each consonant letter represents a single consonantal sound but also has the peculiarity of having an inherent vowel, generally the short vowel /a/ in Devanagari and the other Indic scripts. Thus U+0915    represents not just /k/ but also /ka/. In the presence of a dependent vowel, however, the inherent vowel associated with a consonant letter is overridden by the dependent vowel. Consonant letters may also be rendered as half-forms, which are presentation forms used to depict the initial consonant in consonant clusters. These half-forms do not have an inher- ent vowel. Their rendered forms in Devanagari often resemble the full consonant but are missing the vertical stem, which marks a syllabic core. (The stem glyph is graphically and historically related to the sign denoting the inherent /a/ vowel.) Some Devanagari consonant letters have alternative presentation forms whose choice depends upon neighboring consonants. This variability is especially notable for U+0930   , which has numerous different forms, both as the initial element and as the final element of a consonant cluster. Only the nominal forms, rather than the contextual alternatives, are depicted in the code chart. The traditional Sanskrit/Devanagari alphabetic encoding order for consonants follows articulatory phonetic principles, starting with velar consonants and moving forward to bilabial consonants, followed by liquids and then . ISCII and the Unicode Stan- dard both observe this traditional order. Independent Vowel Letters. The independent vowels in Devanagari are letters that stand on their own. The writing system treats independent vowels as orthographic CV syllables in which the consonant is null. The independent vowel letters are used to write syllables that start with a vowel. Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner of writing noninherent vowels and are generally referred to as vowel signs, or as matras in San- skrit. The dependent vowels do not stand alone; rather, they are visibly depicted in combi- nation with a base letterform. A single consonant, or a consonant cluster, may have a dependent vowel applied to it to indicate the vowel quality of the syllable, when it is differ- ent from the inherent vowel. Explicit appearance of a dependent vowel in a syllable over- rides the inherent vowel of a single consonant letter. The greatest variation among different Indic scripts is found in the way that the dependent vowels are applied to base letterforms. Devanagari has a collection of nonspacing depen- dent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or

212 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari consonant cluster. Other Indic scripts generally have one or more of these forms, but what is a nonspacing mark in one script may be a spacing mark in another. Also, some of the Indic scripts have single dependent vowels that are indicated by two or more glyph compo- nents—and those glyph components may surround a consonant letter both to the left and right or may occur both above and below it. The Devanagari script has only one character denoting a left-side dependent vowel sign: U+093F    . Other Indic scripts either have no such vowel signs (Telugu and Kannada) or include as many as three of these signs (Bengali, Tamil, and Malayalam). A one-to-one correspondence exists between the independent vowels and the dependent vowel signs. Independent vowels are sometimes represented by a sequence consisting of the independent form of the vowel /a/ followed by a dependent vowel sign. For example, Figure 9-1 illustrates this relationship (see the notation formally described in the “Rules for Rendering” later in this section). Figure 9-1. Dependent Versus Independent Vowels

/a/ + Dependent Vowel Independent Vowel

+→+ ≈ An Ivs Ivs An In ŠÌ+ → ̊ ≈ Œ +→+ ≈ An Uvs An Uvs Un Š + Î → ŠÎ ≈ Ž

The combination of the independent form of the default vowel /a/ (in the Devanagari script, U+0905   ) with a dependent vowel sign may be viewed as an alternative spelling of the phonetic information normally represented by an isolated inde- pendent vowel form. However, these two representations should not be considered equiva- lent for the purposes of rendering. Higher-level text processes may choose to consider these alternative spellings equivalent in terms of information content, but such an equivalence is not stipulated by this standard. Virama. Devanagari and other Indic scripts employ a sign known as the virama, halant, or vowel omission sign. A virama sign (for example, U+094D   ) nominally serves to cancel (or kill) the inherent vowel of the consonant to which it is applied. The virama functions as a combining character, with its shape varying from script to script. When a consonant has lost its inherent vowel by the application of virama, it is known as a dead consonant; in contrast, a live consonant is one that retains its inherent vowel or is written with an explicit dependent vowel sign. In the Unicode Standard, a dead consonant is defined as a sequence consisting of a consonant letter followed by a virama. The default rendering for a dead consonant is to position the virama as a combining mark bound to the consonant letterform.

For example, if Cn denotes the nominal form of consonant C, and Cd denotes the dead consonant form, then a dead consonant is encoded as shown in Figure 9-2. Consonant Conjuncts. The Indic scripts are noted for a large number of consonant con- junct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 213 9.1 Devanagari South and Southeast Asian Scripts

Figure 9-2. Dead Consonants

+ → TAn VIRAMAn TAd ±±+→Ü Ü letterforms. This abbreviation takes place only in the context of a consonant cluster. An orthographic consonant cluster is defined as a sequence of characters that represents one or more dead consonants (denoted Cd) followed by a normal, live consonant letter (denoted Cl) or an independent vowel letter. Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if such a glyph is available in the current font(s). In the absence of a conjunct glyph, the one or more dead consonants that form part of the cluster are depicted using half-form glyphs. In the absence of half-form glyphs, the dead consonants are depicted using the nominal con- sonant forms combined with visible virama signs (see Figure 9-3). Figure 9-3. Conjunct Formations

₍₎ + → + ₍₎ + → GAd DHAl GAh DHAn KAd SSHAl K.SSHAn Ü +→´ó´ œÜ +→ÆS

₍₎ + → ₍₎ + → + KAd KAl K.KAn RAd RIn RIn RAsup œÜ +→œP ¿Ü +→F

A number of types of conjunct formations appear in these examples: (1) a half-form of GA in its combination with the full form of DHA; (2) a vertical conjunct K.KA; (3) a fully ligated conjunct K.SSHA, in which the components are no longer distinct; and (4) a rare conjunct formed with an independent vowel letter, in this case the vowel letter (also known as vocalic r). Note that in example (4) in Figure 9-3, the dead consonant RAd is depicted with the nonspacing combining mark RAsup (repha). A well-designed Indic script font may contain hundreds of conjunct glyphs, but they are not encoded as Unicode characters because they are the result of ligation of distinct letters. Indic script rendering software must be able to map appropriate combinations of charac- ters in context to the appropriate conjunct glyphs in fonts. When an independent vowel appears as the element of a consonant cluster, as in example (4) in Figure 9-3, the independent vowel should not be depicted as a dependent vowel sign, but as an independent vowel letterform. Explicit Virama. Normally a virama character serves to create dead consonants that are, in turn, combined with subsequent consonants to form conjuncts. This behavior usually results in a virama sign not being depicted visually. Occasionally, however, this default behavior is not desired when a dead consonant should be excluded from conjunct forma- tion, in which case the virama sign is visibly rendered. To accomplish this goal, the Unicode Standard adopts the convention of placing the character U+200C   - immediately after the encoded dead consonant that is to be excluded from conjunct

214 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari formation. In this case, the virama sign is always depicted as appropriate for the consonant to which it is attached. For example, in Figure 9-4,S the use of   - prevents the default forma- tion of the conjunct form (K.SSHAn). Figure 9-4. Preventing Conjunct Forms

+ + → + KAd ZWNJ SSHAl KAd SSHAn

œÜ ++ZW Æ → œÆÜ NJ

Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, the dead consonant form is often absorbed into the conjunct form, such that it is no longer dis- tinctly visible. In other contexts, however, the dead consonant may remain visible as a half- consonant form. In general, a half-consonant form is distinguished from the nominal consonant form by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the consonant form. In other cases, the vertical stem remains but some part of its right-side geometry is missing. In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct formation yet still not appear with an explicit virama. In these cases, the half-form of the consonant is used. To explicitly encode a half-consonant form, the Unicode Standard adopts the convention of placing the character U+200D    immediately after the encoded dead consonant. The    denotes a nonvisible letter that presents linking or cursive joining behavior on either side (that is, to the previous or fol- lowing letter). Therefore, in the present context, the    may be consid- ered to present a context to which a preceding dead consonant may join so as to create the half-form of the consonant.

For example, if Ch denotes the half-form glyph of consonant C, then a half-consonant form is encoded as shown in Figure 9-5. Figure 9-5. Half-Consonants

+ + → + KAd ZWJ SSHAl KAh SSHAn

œÜ ++ZW Æ → þÆ J

• In the absence of the  S , this sequence would normally pro- duce the full conjunct form (K.SSHAn). This encoding of half-consonant forms also applies in the absence of a base letterform. That is, this technique may also be used to encode independent half-forms, as shown in Figure 9-6. Consonant Forms. In summary, each consonant may be encoded such that it denotes a live consonant, a dead consonant that may be absorbed into a conjunct, or the half-form of a dead consonant (see Figure 9-7).

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 215 9.1 Devanagari South and Southeast Asian Scripts

Figure 9-6. Independent Half-Forms

+ → GAd ZWJ GAh

Ü + ZW → ó J

Figure 9-7. Consonant Forms → œœKAl

+→ œœÜ Ü KAd

œ +→Ü + ZW þ J KAh

Rendering Rules for Rendering. The following provides more formal and detailed rules for minimal rendering of Devanagari as part of a plain text sequence. It describes the mapping between Unicode characters and the glyphs in a Devanagari font. It also describes the combining and ordering of those glyphs. These rules provide minimal requirements for legibly rendering interchanged Devanagari text. As with any script, a more complex procedure can add rendering characteristics, depending on the font and application. It is important to emphasize that in a font that is capable of rendering Devanagari, the set of glyphs is greater than the number of Devanagari Unicode characters. Notation. In the next set of rules, the following notation applies:

Cn Nominal glyph form of consonant C as it appears in the code charts.

Cl A live consonant, depicted identically to Cn.

Cd Glyph depicting the dead consonant form of consonant C.

Ch Glyph depicting the half-consonant form of consonant C.

Ln Nominal glyph form of a conjunct ligature consisting of two or more component consonants. A conjunct ligature composed of two consonants X and Y is also denoted X..

RAsup A nonspacing combining mark glyph form of the U+0930    positioned above or attached to the upper part of a base glyph form. This form is also known as repha.

RAsub A nonspacing combining mark glyph form of the U+0930    positioned below or attached to the lower part of a base glyph form.

Vvs Glyph depicting the dependent vowel sign form of a vowel V.

216 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari

VIRAMAn The nominal glyph form nonspacing combining mark depict- ing U+094D   . • A virama character is not always depicted; when it is depicted, it adopts this nonspacing mark form. Dead Consonant Rule. The following rule logically precedes the application of any other rule to form a dead consonant. Once formed, a dead consonant may be subject to other rules described next.

R1 When a consonant Cn precedes a VIRAMAn , it is considered to be a dead conso- nant Cd . A consonant Cn that does not precede VIRAMAn is considered to be a live consonant Cl . + → TAn VIRAMAn TAd ±±+→Ü Ü

Consonant RA Rules. The character U+0930    takes one of a num- ber of visual forms depending on its context in a consonant cluster. By default, this letter is depicted with its nominal glyph form (as shown in the code charts). In two contexts, it is depicted using a nonspacing glyph form that combines with a base letterform.

R2 If the dead consonant RAd precedes either a consonant or an independent vowel, then it is replaced by the superscript nonspacing mark RAsup , which is positioned so that it applies to the logically subsequent element in the memory representation. + → + Displayed RAd KAl KAl RAsup Output ¿Ü +→œ œ + F → œF

12+ → 2 + 1 RAd RAd RAd RAsup ¿Ü +→¿Ü ¿Ü + F → ¿Ü F

R3 If the superscript mark RAsup is to be applied to a dead consonant and that dead consonant is combined with another consonant to form a conjunct ligature, then the mark is positioned so that it applies to the conjunct ligature form as a whole. + + → + Displayed RAd JAd NYAl J.NYAn RAsup Output ¿Ü +→¦Ü + © “ + F → “ F

R4 If the superscript mark RAsup is to be applied to a dead consonant that is subse- quently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster. + + → + + Displayed RAd GAd GHAl GAh GHAl RAsup Output ¿Ü +→ Ü + ¢ ó + ¢ + F → ó¢ F

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 217 9.1 Devanagari South and Southeast Asian Scripts

R5 In conformance with the ISCII standard, the half-consonant form RRAh is repre- sented as eyelash-RA. This form of RA is commonly used in writing Marathi. + → RRAn VIRAMAn RRAh ¿ +→Ü :

R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant    RAd precedes , then the half -consonant form RAh , depicted as eyelash-RA, is used instead of RAsup . + → RAd ZWJ RAh

¿Ü + ZW → : J

R6 Except for the dead consonant RAd , when a dead consonant Cd precedes the live consonant RAl , then Cd is replaced with its nominal form Cn , and RA is replaced by the subscript nonspacing mark RAsub , which is positioned so that it applies to Cn .

THA + RA → THA + RA Displayed d l n sub Output «Ü +→¿ « + ã → «ã

R7 For certain consonants, the mark RAsub may graphically combine with the conso- nant to form a conjunct ligature form. These combinations, such as the one shown here, are further addressed by the ligature rules described shortly.

PHA + RA → PHA + RA Displayed d l n sub Output ¸Ü +→¿ ¸ + ã → p

R8 If a dead consonant (other than RAd ) precedes RAd , then the substitution of RA for RAsub is performed as described above; however, the VIRAMA that formed RAd remains so as to form a dead consonant conjunct form. + → + + → TAd RAd TAn RAsub VIRAMAn T.RA d ±Ü +→¿Ü ± + ã + Ü → dÜ

A dead consonant conjunct form that contains an absorbed RAd may subsequently combine to form a multipart conjunct form. + → T.RA d YAl T.R.YA n dÜ +→½î½

218 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari

Modifier Mark Rules. In addition to vowel signs, three other types of combining marks may be applied to a component of an orthographic syllable or to the syllable as a whole: nukta, bindus, and svaras. R9 The nukta sign, which modifies a consonant form, is placed immediately after the consonant in the memory representation and is attached to that consonant in ren- dering. If the consonant represents a dead consonant, then NUKTA should precede VIRAMA in the memory representation. + + → KAn NUKTAn VIRAMAn QAd œ +→ + Ü œÜ

R10 The other modifying marks, bindus and svaras, apply to the orthographic syllable as a whole and should follow (in the memory representation) all other characters that constitute the syllable. In particular, the bindus should follow any vowel signs, and the svaras should come last. The relative placement of these marks is horizontal rather than vertical; the horizontal rendering order may vary according to typo- graphic concerns. + + KAn AAvs CANDRABINDUn œ +→Ë + † œË†

Ligature Rules. Subsequent to the application of the rules just described, a set of rules gov- erning ligature formation apply. The precise application of these rules depends on the availability of glyphs in the current font(s) being used to display the text. R11 If a dead consonant immediately precedes another dead consonant or a live conso- nant, then the first dead consonant may join the subsequent element to form a two- part conjunct ligature form. + → + → JAd NYAl J.NYAn TTAd TTHAl TT.TTHAn ¦Ü + © → “ ªÜ + « → _

R12 A conjunct ligature form can itself behave as a dead consonant and enter into further, more complex ligatures. + + → + → SAd TAd RAn SAd T.R.A n S.T.RAn ÇÜ + ±Ü + ¿ → ÇÜ + d → „

A conjunct ligature form can also produce a half-form. + → + K.SSHAd YAl K.SSHh YAn Sð½Ü + ½ →

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 219 9.1 Devanagari South and Southeast Asian Scripts

R13 If a nominal consonant or conjunct ligature form precedes RAsub as a result of the application of rule R2, then the consonant or ligature form may join with RAsub to form a multipart conjunct ligature (see rule R2 for more information). + → + → KAn RAsub K.RAn PHAn RAsub PH.RAn œ + ã → R ¸ + ã → p

R14 In some cases, other combining marks will also combine with a base consonant, either attaching at a nonstandard location or changing shape. In minimal rendering there are only two cases, RAl with Uvs or UUvs . + → + → RAl Uvs RUn RAl UUvs RUUn ¿ + G → L ¿ + H → M

Memory Representation and Rendering Order. The order for storage of plain text in Devanagari and all other Indic scripts generally follows phonetic order; that is, a CV sylla- ble with a dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in the memory representation. This order is employed by the ISCII standard and corresponds with both the phonetic and keying order of textual data (see Figure 9-8). Figure 9-8. Rendering Order

Character Order Glyph Order + → + KAn Ivs Ivs KAn œÌœ+→Ì

Because Devanagari and other Indic scripts have some dependent vowels that must be depicted to the left side of their consonant letter, the software that renders the Indic scripts must be able to reorder elements in mapping from the logical (character) store to the pre- sentational (glyph) rendering. For example, if Cn denotes the nominal form of consonant C, and Vvs denotes a left-side dependent vowel sign form of vowel V, then a reordering of glyphs with respect to encoded characters occurs as just shown.

R15 When the dependent vowel Ivs is used to override the inherent vowel of a syllable, it is always written to the extreme left of the orthographic syllable. If the orthographic syl- lable contains a consonant cluster, then this vowel is always depicted to the left of that cluster. For example: + + → + → + TAd RAl Ivs T.RA n Ivs Ivs T.RA d ±Ü +→+→¿ + Ì d Ì Ìd

Sample Half-Forms. Table 9-1 shows examples of half-consonant forms that are com- monly used with the Devanagari script. These forms are glyphs, not characters. They may be encoded explicitly using    as shown; in normal conjunct formation, they may be used spontaneously to depict a dead consonant in combination with subse- quent consonant forms.

220 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari

Table 9-1. Sample Half-Forms

œþµüÜ ZW Ü ZW J J ž€·åÜ ZW Ü ZW J J ó¸Ü ZW Ü ZW J J ¢ìºêÜ ZW Ü ZW J J ¤ö»‚Ü ZW Ü ZW J J ¦ú¼ƒÜ ZW Ü ZW J J ¨½ñÜ ZW Œ Ü ZW J J ©÷ÁòÜ ZW Ü ZW J J °øÄôÜ ZW Ü ZW J J ±íÅõÜ ZW Ü ZW J J ²ûÆùÜ ZW Ü ZW J J ´çÇïÜ ZW Ü ZW J J

Sample Ligatures. Table 9-2 shows examples of conjunct ligature forms that are commonly used with the Devanagari script. These forms are glyphs, not characters. Not every writing system that employs this script uses all of these forms; in particular, many of these forms are used only in writing Sanskrit texts. Furthermore, individual fonts may provide fewer or more ligature forms than are depicted here. Table 9-2. Sample Ligatures œœPª«_Ü Ü œ±Q««nÜ Ü œ¿R¬ `Ü Ü œÆS¬¬aÜ Ü £œV¬®bÜ Ü £žW±Ü Ü ±c £ X±Ü Ü ¿d £¢Yµµ¾Ü Ü ©¦‘¸¿pÜ Ü ¦©“ÅÜ Ü ¿o ³¢fȼrÜ Ü ³³gȽsÜ Ü ³´hÈÁtÜ Ü

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 221 9.1 Devanagari South and Southeast Asian Scripts

Table 9-2. Sample Ligatures (Continued) ³ºiÈÄuÜ Ü ³»jÈÜ C N ³¼k¿Ü A L ³½l¿Ü B M ³ÄmÇd„Ü Ü ªª^Ü

Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants, half-forms are also used to depict conjunct ligature forms. A sample of such forms is shown in Table 9-3. These forms are glyphs, not characters. They may be encoded explicitly using    as shown; in normal conjunct formation, they may be used spontane- ously to depict a conjunct ligature in combination with subsequent consonant forms. Table 9-3. Sample Half-Ligature Forms

œÆÜ Ü ZW ð J ¦©Ü Ü ZW J ±±Ü Ü ZW ë J ±¿Ü Ü ZW î J Å¿Ü Ü ZW é J

Combining Marks. Devanagari and other Indic scripts have a number of combining marks that could be considered diacritic. One class of these marks, known as bindus, is repre- sented by U+0901    and U+0902   - . These marks indicate or final nasal closure of a syllable. U+093C    is a true diacritic. It is used to extend the basic set of consonant letters by modifying them (with a subscript dot in Devanagari) to create new letters. U+0951..U+0954 are a set of combining marks used in transcription of Sanskrit texts. Digits. Each Indic script has a distinct set of digits appropriate to that script. These digits may or may not be used in ordinary text in that script. European digits have displaced the Indic script forms in modern usage in many of the scripts. Some Indic scripts—notably Tamil—lack a distinct digit for zero. Punctuation and Symbols. U+0964   is similar to a full stop. Corre- sponding forms occur in many other Indic scripts. U+0965    marks the end of a verse in traditional texts. Many modern languages written in the Devanagari script intersperse punctuation derived from the Latin script. Thus U+002C  and U+002E   are freely used in writing Hindi, and the danda is usually restricted to more traditional texts. Encoding Structure. The Unicode Standard organizes the nine principal Indic scripts in blocks of 128 encoding points each. The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions (U+0955..U+095F in

222 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari

Devanagari, for example), which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding. The seventh column in each of these scripts, along with the last 11 positions in the sixth column, represent additional character assignments in the Unicode Standard that are matched across all nine scripts. For example, positions U+xx66..U+xx6F and U+xxE6.. U+xxEF code the Indic script digits for each script. The eighth column for each script is reserved for script-specific additions that do not cor- respond from one Indic script to the next.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 223 9.2 Bengali South and Southeast Asian Scripts

9.2 Bengali

Bengali: U+0980–U+09FF The Bengali script is a North Indian script closely related to Devanagari. It is used to write the Bengali language primarily in West Bengal state (India) and in the nation of Ban- gladesh. It is also used to write Assamese in Assam (India) and a number of other minority languages (Daphla, Garo, Hallam, Khasi, Manipuri, Mizo, Munda, Naga, Rian, and Santali) in northeastern India. Two-Part Vowel Signs. The Bengali script, along with a number of other Indic scripts, makes use of two-part vowel signs; in these vowels one-half of the vowel is placed on each side of a consonant letter or cluster—for example, U+09CB     and U+09CC    . The vowel signs are coded in each case in the position in the charts isomorphic with the corresponding vowel in Devanagari. Hence U+09CC -     is isomorphic with U+094C    . To provide compatibility with existing implementations of the scripts that use two-part vowel signs, the Unicode Standard explicitly encodes the right half of these vowel signs; for example, U+09D7     represents the right-half glyph component of U+09CC    . Special Characters. U+09F2..U+09F9 are a series of Bengali additions for writing currency and fractions. Rendering Behavior. For rendering of the Bengali script, see the rules for rendering in Section 9.1, Devanagari.

224 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.3 Gurmukhi

9.3 Gurmukhi

Gurmukhi: U+0A00–U+0A7F The Gurmukhi script is a North Indian script historically derived from an older script called Lahnda. It is quite closely related to Devanagari structurally. Gurmukhi is used to write the Punjabi language in the Punjab in India. Rendering Behavior. For rendering of the Gurmukhi script, see the rules for rendering in Section 9.1, Devanagari.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 225 9.4 Gujarati South and Southeast Asian Scripts

9.4 Gujarati

Gujarati: U+0A80–U+0AFF The is a North Indian script closely related to Devanagari. It is most obvi- ously distinguished from Devanagari by not having a horizontal bar for its letterforms, a characteristic of the older script to which Gujarati is related. The Gujarati script is used to write the Gujarati language of the Gujarat state in India. Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering in Section 9.1, Devanagari.

226 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.5 Oriya

9.5 Oriya

Oriya: U+0B00–U+0B7F The Oriya script is a North Indian script structurally similar to Devanagari, but with semi- circular lines at the top of most letters instead of the straight horizontal bars of Devanagari. The actual shapes of the letters, particularly for vowel signs, show similarities to Tamil. The Oriya script is used to write the Oriya language, of Orissa state, India, as well as minority languages such as Khondi and Santali. Special Characters. U+0B57     is provided as an encoding for the right side of the surroundrant vowel U+0B4C     . Rendering Behavior. For rendering of the Oriya script, see the rules for rendering in Section 9.1, Devanagari.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 227 9.6 Tamil South and Southeast Asian Scripts

9.6 Tamil

Tamil: U+0B80–U+0BFF The is a South Indian script. South Indian scripts are structurally related to the North Indian scripts, but they are used to write the of southern India and of Sri Lanka, which are genetically unrelated to the North Indian languages such as Hindi, Bengali, and Gujarati. The shapes of letters in the South Indian scripts are generally quite distinct from the shapes of letters in Devanagari and its related scripts. This difference is partly a result of the fact that the South Indian scripts were originally carved with needles on palm leaves, a technology that apparently favored rounded letter shapes rather than square, blocklike shapes. The Tamil script is used to write the of Tamil Nadu state in India as well as minority languages such as Badaga. Tamil is also used in Sri Lanka, Singapore, and parts of Malaysia. This script has fewer consonants than the other Indic scripts. It also lacks con- junct consonant forms. Instead of conjunct consonant forms, the virama (U+0BCD) is normally fully depicted in Tamil text. Naming Conventions for Mid Vowels. The Unicode character encoding for Tamil uses a distinct set of naming conventions for mid vowels in the South Indian (Dravidian) scripts. These conventions are illustrated by U+0B8E    and U+0B8F   , to be contrasted with the isomorphic positions in Devanagari: U+090E     and U+090F   . The Dravidian languages have a reg- ular length distinction in the mid vowels that is not reflected in normal Devanagari. U+090E     is an addition to Devanagari to enable transcription of the Dravidian short vowel forms. The naming conventions are chosen to best reflect the actual nature of the vowels in question in the Dravidian scripts, as well as in Devanagari and the other North Indian scripts. Special Characters. U+0BD7     is provided as an encoding for the right side of the surroundrant (or two-part) vowel U+0BCC    . Rendering of Tamil Script. The South Indic scripts function in much the same way as Devanagari, with the additional feature of two-part vowels. As in the Devanagari example, the words “ ” and “  ” will be omitted where this deletion does not cause ambiguity. It is important to emphasize that in a font that is capable of rendering Tamil, the set of glyphs is greater than the number of Tamil characters. Table 9-4 is a summary of the Tamil letters.

228 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.6 Tamil

Table 9-4. Tamil Letter Summary šžŸ¡£¤¨©­®¯ KA NGA CA JA NYA TTA NNA TA NA NNNA PA ³´µ¶·¸¹º¼ ½¾ MA YA RA RRA LA LLA LLLA SSA SA HA Š‹ŒŽ’“”–—˜ A AA I II U UU E EE AI O OO AU Ãw þÅ þ|ý Ü þ|æ ÝËþ Ìþ Íþ ËþÃÌþÃËþ¸ A AA I II U UU E EE AI O OO AU þ | ¸ VIRAMA AU LENGTH

Independent Versus Dependent Vowels. As with Devanagari, the dependent vowel signs are not equivalent to a sequence of virama + independent vowel. For example: ® + w ≠ ® + þ| + Œ

As in the case of Devanagari, a consonant cluster is any sequence of one or more consonants separated by viramas, possibly terminated with a virama. Two-Part Vowels. Certain Indic vowels consist of two discontiguous elements. As in other cases of discontiguous elements, two sequences of Unicode values can be used to express equivalent spellings. This representation is similar to the case of letters such as “â”, which can be spelled either with “a” followed by a nonspacing “ ”ó or with a single Unicode char- acter “â”.

ËþÃ (0BCA) ≈ Ëþ + Ã (0BC6 + 0BBE) ÌþÃ (0BCB) ≈ Ìþ + Ã (0BC7 + 0BBE)

Ëþ¸ (0BCC) ≈ Ëþ + ¸ (0BC7 + 0BD7)

Note that the ¸ in the third example is not U+0BB3   ; it is U+0BD7    . If the precomposed forms are used in the memory representation instead of the separate characters, then a similar transformation occurs in the rendering process. The precom- posed form on the left is transformed into the two separate forms equivalent to those on the right, which are then subject to vowel reordering, as shown below. Thus in rendering: ËþÃ → Ëþ + Ã ÌþÃ → Ìþ + Ã Ëþ¸ → Ëþ + ¸

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 229 9.6 Tamil South and Southeast Asian Scripts

Vowel Reordering. As shown in Table 9-5, the following vowels are always reordered in front of the previous consonant cluster, similar to the rendering behavior of the -    :

Ëþ (0BC6) Ìþ (0BC7) Íþ (0BC8)

Table 9-5. Vowel Reordering

Memory Representation Display šËþ → ˚ šÌþ → ̚ šÍþ → ͚

The same effect occurs with the results of vowel splitting (see Ta ble 9-6). Table 9-6. Vowel Splitting and Reordering

Memory Representation Display šËþà → ˚à šËþ à → ˚à šÌþà → ̚à šÌþ à → ̚Ã

šËþ¸ → ˚¸

šËþ ¸ → ˚¸

In both cases, the ordering of the elements is unambiguous: the consonant (cluster) occurs first in the memory representation. The vowel ˜ also has two discontinuous parts and can be composed using the   . Ligatures. The following examples illustrate the range of ligatures available in Tamil. These changes take place after vowel reordering and vowel splitting. Unlike Devanagari, Tamil includes very few conjunct consonants; most ligatures are located between a vowel and a neighboring consonant. 1. Conjunct consonants. š + þ | + ¼ → a

As with Devanagari, vowel reordering occurs around conjunct consonants. For example: š + þ | + ¼ + Ëþ + à → ËaÃ

230 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.6 Tamil

2. The vowel à optionally ligates with ¨, ®, or ¶ on its left: ¨ + à → @ ® + à → A ¶ + à → B

Because this process takes place after reordering and splitting, the following lig- atures may also occur: Separate Vowels Precomposed Vowels ¨ + Ëþ + Ã → Ë@ ¨ +Ëþ Ã → Ë@ ¨ + Ìþ + Ã → Ì@ ¨ + ÌþÃ → Ì@ ® + Ëþ + Ã → ËA ® + ËþÃ → ËA ® + Ìþ + Ã → ÌA ® + ÌþÃ → ÌA ¶ + Ëþ+ Ã → ËB ¶ + ËþÃ → ËB ¶ + Ìþ + Ã → ÌB ¶ + ÌþÃ → ÌB

3. The vowel signs w and þ Å form ligatures with ¤ on their left. ¤ + w → C ¤ + þÅ→ D

These vowels often change shape or position slightly to link up with the appro- priate shape of the consonant on their left: · + w → í · + þ Å→ û

4. The vowel signs þý|Ü and þæ|Ý typically change form or ligate (see Table 9-7). Table 9-7. Ligating Vowel Signs

xx + þ ý |Ü x + þ æ |Ý xx + þ ý |Ü x + þ æ |Ý šEF ¯ k g ž ³PQ ŸGH ´lh £ I I× µ R S ¤JK ¶TT× ¨LL× · U U× ©MM׸VW

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 231 9.6 Tamil South and Southeast Asian Scripts

Table 9-7. Ligating Vowel Signs (Continued)

xx + þ ý |Ü x + þ æ |Ý xx + þ ý |Ü x + þ æ |Ý ­NN×¹XWX ®OO× º

To the right of ¡, ¼, ½, ¾, or a, these forms have a spacing form (see Figure 9-9). Figure 9-9. Spacing Forms of Vowels ¡ + þý → ¡Ü ¡ + þ æ → ¡Ý

5. The vowel sign Íþ changes to Î þto the left of ¨, ®, ·, or ¸. Íþ + ¨ → Ψ Íþ + ® → ή Íþ + · → η Íþ + ¸ → θ

Remember that this change takes place after the vowel reordering. In the first example, the vowel Íþ follows ¨ in the memory representation. After vowel reordering, it is on the left of ¨, and thus changes form. The com- plete process is ¨ + Íþ → Íþ + ¨ → Ψ

6. The consonant µ changes shape to Ã. This change occurs when the à form of µ U+0BB0    would not be confused with the nominal form à of U+0BBE     (for example, when – is combined withþ,| w , orþ Å ). µ + þ | → à + þ | µ + w → à + w µ + þÅ → à + þÅ

232 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.7 Telugu

9.7 Telugu

Telugu: U+0C00–U+0C7F The is a South Indian script used to write the of Andhra state in India, as well as minority languages such as Gondi (Adilabad and Koi dia- lects) and . Rendering Behavior. For rendering of the Telugu script, see the rules for rendering in Section 9.6, Tamil. Take note that, unlike Tamil, the Telugu script writes conjunct conso- nants with subscripted letters. There are also numerous consonant letters with contextual shape changes when used in conjuncts. Some vowel signs also change their shape in speci- fied combinations. Special Characters. U+0C55    is provided as an encoding for the sec- ond element of the vowel U+0C47    . U+0C56     is provided as an encoding for the second element of the surroundrant vowel U+0C48    . The length marks are both nonspacing characters.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 233 9.8 Kannada South and Southeast Asian Scripts

9.8 Kannada

Kannada: U+0C80–U+0CFF The is a South Indian script used to write the Kannada (or Kanarese) lan- guage of state, as well as minority languages such as Tulu. It is very closely related to the Telugu script both with regard to the shapes of the letters and in the way that conjunct consonants behave. Special Characters. U+0CD5    is provided as an encoding for the right side of the two-part vowel U+0CC7    . U+0CD6     is provided as an encoding for the right side of the two-part vowel U+0CC8    . The Kannada two-part vowels actually consist of a nonspacing element above the consonant letter and one or more spacing letters to the right of the con- sonant letter. Kannada Letter LLLA. U+0CDE    is actually an obsolete Kannada let- ter that is transliterated in Dravidian scholarship as an “r” marked with the diacritic U+0324   . This form ought to have been given the letter name “”, rather than “”, so the Unicode name is simply a mistake. The letter in question has not been actively used in Kannada since the end of the tenth century. Collations should treat U+0CDE as following U+0CB3   .

234 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.9 Malayalam

9.9 Malayalam

Malayalam: U+0D00–U+0D7F The is a South Indian script used to write the Malayalam language of Ker- state. The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has a very full and complex set of conjunct consonant forms. Special Characters. U+0D57     is provided as an encoding for the right side of the two-part vowel U+0D4C     .

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 235 9.10 Sinhala South and Southeast Asian Scripts

9.10 Sinhala

Sinhala: U+0D80–U+0DFF The Sinhala script, also known as Sinhalese, is used to write the , the majority language of Sri Lanka, formerly Ceylon. It is also used to write the and San- skrit languages. The script is a descendant of Brahmi and resembles the scripts of in form and structure. Sinhala differs from other languages of the region in that it has a series of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other words, both forms occur and are written differently—for example: $%[U+0D85 U+0DAC] / a.Nda/ “sound” versus $&'( [U+0D85 U+0DAB U+0DCA U+0DA9] /a.n.da/ “egg”. In addition, Sinhala has separate distinct signs for both a short and long low front vowel sounding similar to the initial vowel of the English word “apple,” usually represented in IPA as U+00E6 æ (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88; the corresponding dependent forms are U+0DD0 and U+0DD1. Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern established for the other Indic scripts (for example, Devanagari), but does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA   -,) to indicate conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does use many ligatures. Other Letters for Tamil. The Sinhala script may also be used to write Tamil. In this case, some additional combinations may be required. Some letters, such as U+0DBB    and U+0DB1  , may be modified by adding the equiva- lent of a nukta. There is, however, no nukta presently encoded in the Sinhala block. Historical Symbols. Neither the nor the punctuation sign Kunddaliya (U+0DF4) is in general use today, having been replaced by Western digits and Western- style punctuation. The Kunddaliya was formerly used as a full stop; it is included for schol- arly use. The Sinhala numerals are not presently encoded.

236 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.11 Thai

9.11 Thai

Thai: U+0E00–U+0E7F The Thai script is used to write Thai and other Southeast Asian languages, such as Kuy, Lavna, and Pali. It is a member of the Indic family of scripts descended from Brahmi. Thai modifies the original Brahmi letter shapes and extends the number of letters to accommo- date features of the , including tone marks derived from superscript digits. On the other hand, Thai script lacks the conjunct consonant mechanism and independent vowel letters found in most other Bhrami-derived scripts. As in all scripts of this family, the predominant writing direction is left to right. The is closely related to Thai, and the encoding principles described in this sec- tion apply to the Lao encoding as well. Standards. Thai layout in the Unicode Standard is based on the Thai Industrial Standard 620-2529, and its updated version 620-2533. Encoding Principles. In common with the Indic scripts, each Thai letter is a consonant possessing an inherent vowel sound. Thai letters further feature inherent tones. The inher- ent vowel and tone can be modified by means of vowel signs and tone marks attached to the base consonant letter. Some of the vowel signs and all of the tone marks are rendered in the script as diacritics attached above or below the base consonant. These combining signs and marks are encoded after the modified consonant in the memory representation. Most of the Thai vowel signs are rendered by full letter-sized in-line glyphs placed either before (that is, to the left of) or after (to the right of) or around (on both sides of) the glyph for the base consonant letter. In the Thai encoding, the letter-sized glyphs that are placed before (left of) the base consonant letter, in full or partial representation of a vowel sign, are in fact encoded as separate characters that are typed and stored before the base consonant character. This encoding for left-side Thai vowel sign glyphs (and similarly in Lao) differs from the conventions for all other Indic scripts, which uniformly encode all vowels after the base consonant. The difference is necessitated by the encoding practice commonly employed with Thai character data as represented by the Thai Industrial Stan- dard. Thai Punctuation. Thai uses a variety of punctuation marks particular to this script. U+0E4F    is the Thai bullet, used to mark items in lists, or appearing at the beginning of a verse, sentence, paragraph, or other textual segment. U+0E46    is used to mark repetition of preceding letters. U+0E2F    is used to indicate ellision or abbreviation of letters; it is itself viewed as a kind of letter, however, and is used with considerable frequency because of its appearance in such words as the Thai name for Bangkok. Paiyannoi is also used in combination (U+0E2F U+0E25 U+0E2F) to create a construct called paiyanyai, which means “, and so forth.” The Thai paiyanyai is comparable to the isomorphic analogue in the : U+17D8   . U+0E5A    is used to mark the end of a long segment of text. It can be combined with a following U+0E30     to mark a larger seg- ment of text; typically this usage can be seen at the end of a verse in poetry. U+0E5B    marks the end of a chapter or document, where it always follows the angkhankhu + sara a combination. The Thai angkhankhu and its combination with sara a to mark breaks in text have analogues in many other Brahmi-derived scripts. For example, they are closely related to U+17D4    and U+17D5  

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 237 9.11 Thai South and Southeast Asian Scripts

, which are themselves ultimately related to the danda and double danda of Devanagari. Thai words are not separated by spaces. Text is laid out with spaces introduced at text seg- ments where Western typography would typically make use of commas or periods. How- ever, Latin-based punctuation such as comma, period, and colon are also used in text, particularly in conjunction with Latin letters, or in formatting numbers, addresses, and so forth. If word boundary indications are desired—for example, for the use of automatic line layout algorithms—the character U+200B    should be used to place invisible marks for such breaks. The    can grow to have a visible width when justified. Thai Transcription of Pali and Sanskrit. The Thai script is frequently used to write Pali and Sanskrit. When so used, consonant clusters are represented by the explicit use of U+0E3A    (virama) to mark the removal of the inherent vowel. There is no conjoining behavior, unlike other Indic scripts. U+0E4D    is the Pali nigghahita and Sanskrit anusvara. U+0E30     is the Sanskrit visarga. U+0E24    and U+0E26    are vocalic /r/ and /l/, with U+0E45    used to indicate their lengthening.

238 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.12 Lao

9.12 Lao

Lao: U+0E80–U+0EFF The and script are closely related to Thai. The Unicode Standard encodes the Lao script in the same relative order as Thai. A few additional letters in Lao have no match in Thai: U+0EBB      U+0EBC     U+0EBD     The preceding two semivowel signs are the last remnants of the system of subscript medials, which in Myanmar also includes original “rw”. Myanmar and Khmer include a full set of subscript consonant forms used for conjuncts. Thai no longer uses any of these forms; Lao has just the two. There are also two ligatures in the Unicode character encoding for Lao: U+0EDC    and U+0EDD   . They correspond to sequences of [h] plus [n] or [h] plus [m] without ligating. Their function in Lao is to provide versions of the [n] and [m] consonants with a different inherent tonal implication.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 239 9.13 Tibetan South and Southeast Asian Scripts

9.13 Tibetan

Tibetan: U+0F00–U+0FBF The Tibetan script is used for writing Tibetan in several countries and regions throughout the Himalayas. Aside from itself, the script is used in Ladakh, Nepal, and northern areas of India bordering Tibet where large Tibetan-speaking populations now reside. The Tibetan script is also used in Bhutan to write Dzongkha, the official language of that coun- try. Tibetan is also used as the language of philosophy and liturgy by Buddhist traditions spread from Tibet into the Mongolian cultural area that encompasses , Buriatia, Kalmykia, and Tuva. The Tibetan scripting and grammatical systems were originally defined together in the sixth century by royal decree when the Tibetan King Songtsen Gampo sent 16 men to India to study Indian languages. One of those men, Thumi Sambhota, is credited with creating the Tibetan writing system upon his return, having studied various Indic scripts and gram- mars. The king’s primary purpose was to bring Buddhism from India to Tibet. The new script system was therefore designed with compatibility extensions for Indic (principally Sanskrit) transliteration so that Buddhist texts could be properly represented. Because of this origin, over the last 1,500 years the Tibetan script has also been widely used to repre- sent Indic words, a number of which have been adopted into the Tibetan language retain- ing their original spelling. A note on Latin transliteration: Tibetan spelling is traditional, and does not generally reflect modern pronunciation. Throughout this section, Tibetan words are represented in italics when transcribed as spoken, followed at first occurrence by a parenthetical translit- eration; in these transliterations, presence of the tsek (tsheg) character is expressed with a hyphen. Thumi Sambhota’s original grammar treatise defined two script styles. The first, called uchen (dbu-can, “with head”), is a formal “inscriptional capitals” style said to be based on an old form of Devanagri. It is the script used in Tibetan xylograph books and the one used in the coding tables. The second style, called u-mey (dbu-med, or “headless”), is more cur- sive and said to be based on the Wartu script. Numerous styles of u-mey have evolved since then, including both formal calligraphic styles used in manuscripts and running handwrit- ing styles. All Tibetan scripts follow the same lettering rules, though there is a slight differ- ence in the way that certain compound stacks are formed in uchen and u-mey. General Principles of the Tibetan Script. Tibetan grammar divides letters into consonants and vowels. There are 30 consonants, and each consonant is represented by a discrete writ- ten character. There are five vowel sounds, only four of which are represented by written marks. The four vowels that are explicitly represented in writing are each represented with a single mark that is applied above or below a consonant to indicate the application of that vowel to that consonant. The absence of one of the four marks implies that the first vowel sound (like a short “ah” in English) is present and is not modified to one of the four other possibilities. Three of the four marks are written above the consonants; one is written below. Each word in Tibetan has a base or root consonant. The base consonant can be written sin- gly or it can have other consonants added above or below it to make a vertically “stacked” letter. Tibetan grammar contains a very complete set of rules regarding letter gender, and these rules dictate which letters can be written in adjacent positions. The rules therefore dictate which combinations of consonants can be joined to make stacks. Any combination not allowed by the gender rules does not occur in native Tibetan words. However, when

240 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan transcribing other languages (for example, Sanskrit, Chinese) into Tibetan, these rules do not operate. In certain instances, other than transliteration, any consonant may be com- bined with any other subjoined consonant. Implementations should therefore be prepared to accept and display any combinations. The model adopted to encode the Tibetan lettering set described above contains the follow- ing groups of items: Tibetan consonants, vowels, numerals, punctuation, ornamental signs and marks, and Tibetan-transliterated Sanskrit consonants and vowels. Each of these will be described below. Both in this description and in Tibetan, the terms “subjoined” (-btags) and “head” (-mgo) are used in different senses. In the structural sense, they indicate specific slots defined in native Tibetan orthography. In spatial terms, they refer to the position in the stack; any- thing in topmost position is “head,” anything not in topmost position is “subjoined.” Unless explicitly qualified, the terms “subjoined” and “head” are used here in their spatial sense. For example, in a conjunct like “rka” the letter in the root slot is “KA”, but, because it is not the topmost letter of the stack, it is expressed with a subjoined character code, while “RA”, which is structurally in the head slot, is expressed with a nominal character code. On the other hand, in a conjunct “” in which the root slot is also occupied with “KA”, the “KA” is encoded with a nominal character code, because it is in the topmost position in the stack. The Tibetan script has its own system of formatting and details of that system relevant to the characters encoded in this standard are explained herein. However, an increasing num- ber of publications in Tibetan do not strictly adhere to this original formatting system. This change is due to the partial move from publishing on long, horizontal, loose-leaf , to publishing in vertically oriented, bound books. The Tibetan script also has a punctuation set designed to meet needs quite different from the punctuation that has evolved for West- ern scripts. With the appearance of Tibetan newspapers, magazines, school textbooks, and Western style reference books in the last 20 or 30 years, Tibetans have begun using things like columns, indented blocks of text, Western-style headings, and footnotes. Some West- ern punctuation marks, including brackets, parentheses, and quotation marks, are also becoming commonplace in these kinds of publication. With the introduction of more sophisticated electronic publishing systems, there is also a renaissance in the publication of voluminous religious and philosophical works in the traditional horizontal, loose-leaf for- mat—many set in digital closely conforming to the proportions of traditional hand-lettered text. Consonants. The system devised to encode the Tibetan system of writing consonants in both single and stacked forms as described above is as follows: All of the consonants are encoded a first time from U+0F40 through U+0F69. There are the basic Tibetan consonants and, in addition, six compound consonants used to represent the Indic consonants gha, , d.ha, dha, bha, and ksh.a. These codes are used to represent occurrences of either a stand-alone consonant or a consonant in the head position of a ver- tical stack. Glyphs generated from these codes will always sit in the normal position starting at and dropping down from the design baseline. All of the consonants are then encoded a second time. These second encodings from U+0F90 through U+0FB9 represent conso- nants in subjoined stack position. To represent a single consonant in a text stream, one of the first, “nominal,” set of codes is placed. To represent a stack of consonants in the text stream, a “nominal” consonant code is followed directly by one or more of the subjoined consonant codes. The stack so formed continues for as long as subjoined consonant codes are contiguously placed. This encoding method was chosen over an alternative method that would have involved a virama-based encoding, like Devanagari. There were two main reasons for this choice.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 241 9.13 Tibetan South and Southeast Asian Scripts

First, the virama is not normally used in the Tibetan writing system to create letter combi- nations. (There is a virama in the Tibetan script, but only because of the need to represent Devanagari; it is called “srog-med” and encoded at U+0F84. The virama is never used in writing Tibetan words and can be, but is almost never, used as a substitute for stacking in writing Sanskrit mantras in the Tibetan script.) Second, there is a prevalence of stacking in native Tibetan, and the model chosen specifically results in decreased data storage require- ments. Furthermore, in languages other than Tibetan, there are many cases where stacks occur that do not occur in Tibetan-language texts; it is thus imperative to have a model that allows for any consonant to be stacked with any subjoined consonant(s). Thus a model for stack building was chosen that follows the Tibetan approach to creating letter combina- tions, but is not limited to a specific set of the possible combinations. Vowels. Each of the four basic Tibetan vowel marks mentioned above is coded as a separate entity. They are U+0F72, U+0F74, U+0F7A, and U+0F7C. For compatibility, a set of sev- eral compound vowels for Sanskrit transcription is also provided in the other codepoints between U+0F71 and U+0F7D. Most Tibetan users do not view these compound vowels as single characters, and their use is limited to Sanskrit words. It is acceptable for users to enter these compounds as a series of simpler elements and have software render them appropriately. Canonical equivalences are specified for all except U+0F77 and U+0F79. All vowel signs are nonspacing marks above or below a stack of consonants, sometimes on both sides. A stand-alone consonant or a stack of consonants can have a vowel sign applied to it. In accordance with the rules of Tibetan writing, a code for a vowel sign applied to a consonant should always be placed after the bare consonant or the stack of consonants formed by the method just described. All of the symbols and punctuation marks have straightforward encodings. Further infor- mation about many of them is included below. Coding Order. In general, the correct coding order for a stream of text will be the same as the order in which Tibetans spell and in which the characters of the text would be written by hand. For example, the correct coding order for the most complex Tibetan stack would be: head position consonant first subjoined consonant ... (intermediate subjoined consonants if any) last subjoined consonant subjoined vowel a-chung (U+0F71) standard or compound vowel sign, or virama Where used, the character U+0F39    - occurs immediately after the consonant it modifies. Allographical Considerations. When consonants are combined to form a stack, one of them retains the status of being the principal consonant in the stack. The principal conso- nant always retains its stand-alone form. However, consonants placed in the “head” and “subjoined” positions to the main consonant sometimes retain their stand-alone form and sometimes are given a new, special form. Because of this fact, certain of the consonants are given a further, special encoding treatment. The affected consonants are “wa” (U+0F5D), “ya” (U+0F61), and “ra” (U+0F62). Head Position “ra”. When the consonant “ra” is written in the “head” position (ra-mgo) at the top of a stack in the normal Tibetan-defined lettering set, the shape of the consonant can change. It can either be a full-form shape or the full-form shape but with the bottom stroke removed (looking like a short-stemmed letter “T”). This requirement of “ra” in the

242 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan head position where the glyph representing it can change shape is correctly coded by using the stand-alone “ra” consonant (U+0F62) followed by the appropriate subjoined conso- nant(s). For example, in the normal Tibetan ra-mgo combinations, the “ra” in the head position is mostly written as the half-ra but in the case of “ra + sub-joined nya” must be written as the full-form “ra”. Thus the normal Tibetan ra-mgo combinations are correctly encoded with the normal “ra” consonant (U+0F62) because it can change shape as required. It is the responsibility of the font developer to provide the correct glyphs for rep- resenting the characters where the “ra” in the head position will change shape—for exam- ple, as in “ra + subjoined nya”. Full-Form “ra” in Head Position. Some instances of “ra” in the head position require that the consonant be represented as a full-form “ra” that never changes. This is not standard usage for the Tibetan language itself, but occurs in transliteration and transcription. In these cases, only the code U+0F6A    - should be used. This “ra” will always be represented as a full-form “ra consonant” and will never change shape to the form where the lower stroke has been cut off. For example, the letter combination “ra + ya” when appearing in transliterated Sanskrit works is correctly written with a full- form “ra” followed by either a modified subjoined “ya” form or a full-form subjoined “ya” form. Note that the fixed-form “ra” should be used only in combinations where “ra” would normally transform into a short form but the user specifically wants to prevent that change. For example, the combination “ra + subjoined nya” never requires the use of fixed- form “ra”, because “ra” normally retains its full glyph form over “nya”. It is the responsibil- ity of the font developer to provide the appropriate glyphs to represent the encodings. Subjoined Position “wa”, “ya”, and “ra”. All three of these consonants can be written in subjoined position to the main consonant according to normal Tibetan grammar. In this position, all of them change to a new shape. The “wa” consonant when written in sub- joined position is not a full “wa” letter any longer but is literally the bottom-right corner of the “wa” letter cut off and appended below. For that reason it is called a wazur (wa-zur, or “corner of a wa”) or less frequently, but just as validly, wa-ta (wa-btags) to indicate that it is a subjoined “wa”. The consonants “ya” and “ra” when in the subjoined position are called ya-ta (ya-btags) and ra-ta (ra-btags), respectively. To encode these subjoined consonants that follow the rules of normal Tibetan grammar, the shape-changed, subjoined forms U+0F5D   , U+0F61   , and U+0F62    should be used. All three of these subjoined consonants also have full-form non-shape-changing counter- parts for the needs of transliterated and transcribed text. For this purpose, the full sub- joined consonants that do not change shape (encoded at U+0FBA, U+0FBB, and U+0FBC, respectively) are used where necessary. The combinations of “ra + ya” are a good example because they include instances of “ra” taking a short (ya-btags) form and “ra” taking a full- form subjoined “ya”. U+0FB0    - (a-chung) should be used only in the very rare cases where a full-sized subjoined a-chung letter is required. The small vowel lengthening a-chung encoded as U+0F71     is far more frequently used in Tibetan text, and it is therefore recommended that implementations treat this character rather than U+0FB0 as the normal subjoined a-chung. Line-Breaking Considerations. Tibetan text separates units called natively “tsheg-bar”, an inexact translation of which is “syllable.” Tsheg-bar is literally the unit of text between tseks, and is generally a consonant cluster with all of its prefixes, suffixes, and vowel signs. It is not a “syllable” in the English sense. Tibetan script has two break characters only. The primary break character is the standard interword tsek (“tsheg”), which is encoded at U+0F0B. The other break character is the space. Space or tsek characters in a stream of Tibetan text are not always break characters

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 243 9.13 Tibetan South and Southeast Asian Scripts and so need proper contextual handling. Issues surrounding these two potential break characters will now be discussed. The primary delimiter character in Tibetan text is the tsek (U+0F0B   -  ). In general, automatic line-breaking processes may break after any occur- rence of this tsek, except where it follows a U+0F44    (with or without a vowel sign) and precedes a shay (U+0F0D), or where Tibetan grammatical rules do not permit a break. (Normally, tsek is not written before shay except after “nga”. This type of tsek-after-nga is called “nga-phye-tsheg”, and may be expressed by U+0F0B, or by the spe- cial character U+0F0C, a nonbreaking form of tsek.) The Unicode names for these two types of tsek are misnomers, retained for compatibility. The standard tsek U+0F0B     is always required to be a potentially breaking char- acter, whereas the “nga-phye-tsheg” is always required to be a nonbreaking tsek. U+0F0C      is specifically not a “delimiter” and is not for gen- eral use. There are no other break characters in Tibetan text. Unlike English, Tibetan has no system for hyphenating or otherwise breaking a word within the group of letters making up the word. Tibetan text formatting does not allow text to be broken within a word. Whitespace appears in Tibetan text, although it should be represented by U+00A0 -   instead of U+0020 . Tibetan text breaks lines after tsek instead of at whitespace. Complete Tibetan text formatting is best handled by a formatter in the application and not just by the code stream. If the interword and nonbreaking tseks are properly employed as breaking and nonbreaking characters, respectively, and if all spaces are nonbreaking spaces, then any application will still wrap lines correctly on that basis, even though the breaks might be sometimes inelegant. Tibetan Punctuation. The punctuation apparatus of Tibetan is relatively limited. The principle punctuation characters are the tsek already mentioned, the shay (transliterated “shad”), which is a vertical stroke used to mark the end of a section of text, the space used sparingly as a space, and two of several variant forms of the shay that are used in specialized situations requiring a shay. There are also several other marks and signs but they are spar- ingly used. The shay at U+0F0D marks the end of a piece of text called “tshig-grub”. The mode of marking bears no commonality with English phrases or sentences and should not be described as a delimiter of phrases. In Tibetan grammatical terms, a shay is used to mark the end of an expression (“brjod.pa”) and a complete expression. Two shays are used at the end of whole topics (“don.tshan”). Because some writers use the double shay with a differ- ent spacing than would be obtained by coding two adjacent occurrences of U+0F0D, the double shay has been coded at U+0F0E with the intent that it would have a larger spacing between component shays than if two shays were simply written together. However, most writers do not use an unusual spacing between the double shay, so the application should allow the user to write two U+0F0D codes one after the other. Additionally, font designers will have to decide whether to implement these shays with a larger than normal gap. The U+0F11 rin-chen-pung-shay (rin-chen-spung-shad) is a variant shay used in a specific “new-line” situation. Its use was not defined in the original grammars but Tibetan tradi- tion gives it a highly defined use. The drul-shay (“sbrul-shad”) is likewise not defined by the original grammars but Tibetan tradition gives it a highly defined use. It is used for sep- arating sections of meaning that are equivalent to topics (“don.tshan”) and subtopics. A drul-shay is usually surrounded on both sides by the equivalent of about three spaces (though there is no rule specified). Hard spaces will be needed for these because the

244 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan drul-shay should not appear at the beginning of a new line and the whole structure of spac- ing-plus-shay should not be broken up, if possible. Tibetan texts use a yig-go (“head mark,” yig-mgo) to indicate the beginning of the front of a , there being no other certain way, in the loose-leaf style of traditional Tibetan books, to tell which is the front of a page. The head mark can and does vary from text to text; there are many different ways to write it. The common type of head mark has been provided for with U+0F04        and its extension U+0F05     . An initial mark yig-go can be written alone or combined with as many as three closing marks following it. When the initial mark is written in com- bination with one or more closing marks, the individual parts of the whole must stay in proper registration with each other to appear authentic. Therefore, it is strongly recom- mended that font developers create precomposed ligature glyphs to represent the various combinations of these two characters. The less common head marks mainly appear in Nyingmapa and Bonpo literature. Three of these head marks have been provided for with U+0F01, U+0F02, and U+0F03; however, many others have not been encoded. Font devel- opers will have to deal with the fact that many types of head marks in use in this literature have not been encoded, cannot be represented by a replacement that has been encoded, and will be required by some users. Two characters, U+0F3C      and U+0F3D     , are paired punctuation, typically used together forming a roof over one or more digits or words. In this case, kerning or special ligatures may be required for proper rendering. The right ang khang may also be used much as a single closing parenthesis is used in forming lists; again, special kerning may be required for proper rendering. The marks U+0F3E     and U+0F3F     are paired signs used to combine with digits; special glyphs or compositional metrics are required for their use. A set of frequently occurring astrological and religious signs specific to Tibetan is encoded between U+0FBE and U+0FCF. U+0F34 means “et cetera” or “and so on” and is used after the first few tsek-bar of a recur- ring phrase. U+0FBE (often three times) indicates a refrain. U+0F36 and U+0FBF are used to indicate where text should be inserted within other text or as references to footnotes or marginal notes. Other Characters. The Wheel of Dharma, which occurs sometimes in Tibetan texts, is encoded in the Miscellaneous Symbols block at U+2638. Left-facing and right-facing swastika symbols are likewise used. They are found among the Chinese ideographs at U+534D (“yung-drung-chi-khor”) and U+5350 (“yung-drung- nang-khor”). The marks U+0F35       and U+0F37       conceptually attach to a tsek-bar rather than to an individual character and function more like attributes than characters—like underlining to mark or emphasize text. In Tibetan interspersed commentaries, they may be used to tag the tsek-bar belonging to the root text that is being commented on. The same thing is often accomplished by set- ting the tsek-bar belonging to the root text in large type and the commentary in small type. Correct placement of these glyphs may be problematic. If they are treated as normal com- bining marks, they can be entered into the text following the vowel signs in a stack; if used, their presence will need to be accounted for by searching algorithms, and so forth. Tibetan Half-Numbers. The half-number forms (U+0F2A..U+0F33) are peculiar to Tibetan, though other scripts (for example, Bengali) have similar fractional concepts. The value of each half-number is 0.5 less than the number within which it appears. These forms

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 245 9.13 Tibetan South and Southeast Asian Scripts are used only in some traditional contexts and appear as the last digit of a multidigit num- ber. The sequence of digits “U+0F24 U+0F2C” represents the number 42.5 or forty-two and one-half. Tibetan Transliteration and Transcription of Other Languages. Tibetan traditions are in place for transliterating other languages. Most commonly, Sanskrit has been the language being transliterated, though Chinese has become more common in modern times. Addi- tionally, Mongolian has a transliterated form. There are even some conventions for translit- erating English. One feature of Tibetan script/grammar is that it allows for totally accurate transliteration of Sanskrit. The basic Tibetan letterforms and punctuation marks contain most of what is needed, although a few extra things are required. With these additions, Sanskrit can be transliterated perfectly into Tibetan, and the Tibetan transliteration can be rendered backward perfectly into Sanskrit with no ambiguities or difficulties. The six Sanskrit retroflex letters are interleaved among the other consonants. The compound Sanskrit consonants are not included in normal Tibetan. They could be made using the method described earlier for Tibetan stacked consonants, generally by sub- joining “ha”. However, to maintain consistency in transliterated texts and for ease in trans- mission and searching, it is recommended that implementations of Sanskrit in the Tibetan script use the precomposed forms of aspirated letters (and U+0F69 “ka + reversed sha”) whenever possible, rather than implementing these consonants as completely decomposed stacks. Note that implementations must ensure that decomposed stacks and precomposed forms are interpreted equivalently (see Section 3.6, Decomposition). The compound conso- nants are explicitly coded as follows: U+0F93    , U+0F9D    , U+0FA2    , U+0FA7    , U+0FAC    , and U+0FB9    . The vowel signs of Sanskrit not included in Tibetan are encoded with other vowel signs between U+0F70 and U+0F7D. U+0F7F     (nam chay) is the visarga, and U+0F7E       (ngaro) is the anusvara. See Section 9.1, Devanagari, for more information on these two characters. The characters encoded in the range U+0F88..U+0F8B are used in transliterated text and are most commonly found in Kalachakra literature. When the Tibetan script is used to transliterate Sanskrit, consonants are sometimes stacked in ways that are not allowed in native Tibetan stacks. Even complex forms of this stacking behavior are catered for properly by the method described earlier for coding Tibetan stacks. Other Signs. U+0F09      is a list enumerator used at the start of administrative letters in Bhutan, as is the petition honorific U+0F0A   -   . U+0F3A      and U+0F3B      are paired punctuation marks (brackets). The sign U+0F39    - (tsa-’phru, which is a lenition mark) is the ornamental flaglike mark that is an integral part of the three consonants U+0F59   , U+0F5A   , and U+0F5B   . Although those consonants are not decomposable, this mark has been abstracted and may by itself be applied to “pha” and other consonants to make new letters for use in transliteration and transcription of other languages. For example, in modern literary Tibetan, it is one of the ways used to transcribe the Chinese “fa” and “va” sounds not represented by the normal Tibetan consonants. Tsa-’phru is also used to represent tsa, tsha, or dza in abbreviations. Traditional Text Formatting and Line Justification. Native Tibetan texts (“pecha”) are written and printed using a justification system that is, strictly speaking, right-ragged but

246 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan with an attempt to right justify. Each page has a margin. That margin is usually demarcated with visible border lines required of a pecha. In modern times, as Tibetan text is produced in Western-style books, the margin lines may be dropped and an invisible margin used. When writing the text within the margins, an attempt is made to have the lines of text jus- tified up to the right margin. To do so, writers keep an eye on the overall line length as they fill lines with text and try manually to justify to the right margin. Even then, there is often a gap at the right margin that cannot be filled. If the gap is short, it will be left as is and the line will be said to be justified enough, even though by machine-justification standards the line is not truly flush on the right. If the gap is large, the intervening space will be filled with as many tseks as are required to justify the line. Again, the justification is not done perfectly in the way that English text might be perfectly right-justified; as long as the last tsek is more or less at the right margin, that will do. The net result is that of a right-justified, blocklike look to the text, but the actual lines are always a little right-ragged. Justifying tseks are nearly always used to pad the end of a line when the preceding character is a tsek—in other words, when the end of a line arrives in the middle of tshig-grub (see the previous definition under “Tibetan Punctuation”). However, it is unusual for a line that ends at the end of tshig-grub to have justifying tseks added to the shay at the end of the tshig-grub. That is, xxxx| | is not usually padded like this (though it is allowable): xxxx| |..... In this case, instead of justifying the line with tseks, the space between shays is enlarged and/or the whitespace following the final shay is usually left as is. Padding is never applied following an actual space character. For example, given the existence of a space after a shay, the following may not be written with the padding, xxxxxxx| ...... because the final shay should have a space after it, and padding is never applied after spaces. The same applies where the final consonant of a tshig-grub that ends a line is a “ka” or “ga”. In that case, the ending shay is dropped but a space is still required after the consonant and that space must not be padded. For example, the following is not acceptable: xxxxxga ...... Tibetan text has two rules regarding the formatting of text at the beginning of a new line. There are severe constraints on which characters can start a new line, and the rule is tradi- tionally stated as follows: a shay of any description may never start a new line. Nothing but actual words of text can start a new line, the only exception being a go-yig (yig-mgo) at the head of a front page or a da- (zla-tshe, meaning “crescent moon”—for example, U+0F05) or one of its variations, which is effectively an “in-line” go-yig (yig-mgo), on any other line. One of two or three ornamental shays is also commonly used in short pieces of prose in place of the more formal da-tshe. This rule also means that a space may not start a new line in the flow of text. If there is a major break in a text, a new line might be indented. A syllable (tsheg-bar) that comes at the end of a tshig-grub and that starts a new line must have the shay that would normally follow it replaced by a rin-chen-spung-shad (U+0F11). (The reason for this rule is that the presence of the rin-chen-spung-shad makes the end of tshig-grub more visible and hence makes the text easier to read.) In verse, the second shay following the first rin-chen-spung-shad is also replaced some- times with a rin-chen-spung-shad, though the practice is formally incorrect. It is a writer’s trick done to make a particular scribing of a text more elegant. It is moderately popular device but does breaks the rule. Not only is rin-chen-spung-shad used as the replacement

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 247 9.13 Tibetan South and Southeast Asian Scripts for the shay but a whole class of “ornamental shays” are used for the same purpose. All are scribal variants on a rin-chen-spung-shad, which is correctly written with three dots above it. Tibetan Shorthand Abbreviations (bskungs-yig) and Limitations of the Encoding. A con- sonant functioning as the word-base (ming-gzhi) is allowed to take only one vowel sign according to Tibetan grammar. The Tibetan shorthand writing technique called bskungs- yig does allow one or more words to be contracted into a single, very unusual combination of consonants and vowels. This construction frequently entails the application of more than one vowel sign to a single consonant or stack, and the composition of the stacks them- selves can break the rules of normal Tibetan grammar. For this reason, vowel signs do sometimes interact typographically, which accounts for their particular combining classes (see Section 4.2, Combining Classes—Normative). The Unicode Standard accounts for plain text compounds of Tibetan that contain at most one base consonant, any number of subjoined consonants, followed by any number of vowel signs. This coverage constitues the vast majority of Tibetan text. Rarely, stacks are seen that contain more than one such consonant-vowel combination in a vertical arrange- ment. These stacks are highly unusual and are considered beyond the scope of plain text rendering. They may be handled by higher-level mechanisms.

248 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.14 Myanmar

9.14 Myanmar

Myanmar: U+1000–U+109F The Myanmar script is used to write Burmese, the majority language of Myanmar (for- merly called Burma). Variations and extensions of the script are used to write other lan- guages of the region, such as Shan and Mon, as well as Pali and Sanskrit. The Myanmar script was formerly known as the , but the term “Myanmar” is now pre- ferred. The Myanmar writing system derives from a Brahmi-related script borrowed from South India in about the eighth century for the . The first inscription in the Myan- mar script dates from the eleventh century, using an alphabet almost identical to that of the Mon inscriptions. Aside from rounding of the originally square characters, this script has remained largely unchanged to the present. It is said that the rounder forms were devel- oped to permit writing on palm leaves without tearing the writing surface of the leaf. Because of its Brahmi origins, the Myanmar script shares the structural features of its Indic relatives: consonant symbols include an inherent “a” vowel; various signs are attached to a consonant to indicate a different vowel; ligatures and conjuncts are used to indicate conso- nant clusters; and the overall writing direction is left to right. Thus, despite great differ- ences in appearance and detail, the Myanmar script follows the same basic principles as, for example, Devanagari. Standards. There is not yet an official national standard for the encoding of Myanmar/ Burmese. The current encoding was prepared with the consultation of experts from the Myanmar Information Technology Standardization Committee (MITSC) in Yangon (Rangoon). The MITSC, formed by the government in 1997, consists of experts from the Myanmar Computer Scientists’ Association, Myanmar Language Commission, and Myan- mar Historical Commission. Encoding Principles. As with Indic scripts, the Myanmar encoding represents only the basic underlying characters; multiple glyphs and rendering transformations are required to assemble the final visual form for each syllable. Even some single characters, such as U+102C    , may assume variant forms depending on the other characters with which they combine. Conversely, characters or combinations that may appear visually identical in some fonts, such as U+101D    and U+1040   , are distinguished by their underlying encoding. Composite Characters. As is the case in Extended Latin and many other scripts, some Myanmar letters or signs may be analyzed as composites of two or more other characters, and are not encoded separately. The following is an example of a Myanmar letter repre- sented by a combining character sequence: myanmar vowel sign o = U+1031     + U+102C     Encoding Subranges. The basic consonants, independent vowels, and dependent vowel signs required for writing the Myanmar language are encoded at the beginning of the Myanmar range. Extensions of each of these categories for use in writing other languages, such as Pali and Sanskrit, are appended at the end of the range. In between these two sets lie the script-specific signs, punctuation, and digits. Conjunct and Medial Consonants. As in other Indic-derived scripts, conjunction of two consonant letters is indicated by the insertion of a virama U+1039   

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 249 9.14 Myanmar South and Southeast Asian Scripts between them; it causes ligation or other rendered combination of the consonants, although the virama itself is not rendered visibly. The conjunct form of U+1004    is rendered as a superscript sign called kinzi. Kinzi is encoded in logical order as a conjunct consonant before the syllable to which the mark applies, such as the Devanagari ra. (See Section 9.1, Devanagari, Rule R2.) For example, kinzi applied to U+1000    would be written via the fol- lowing sequence: U+1004    U+1039    U+1000    The Myanmar script traditionally distinguishes a set of subscript “medial” consonants: forms of , , , and  that are considered to be modifiers of the syllable’s vowel. In the Myanmar encoding, the medial consonants are treated as conjuncts; that is, they are coded using the virama. So, for example, the word “kywe” (“to drop off”) would be written via the following sequence: U+1000    U+1039    U+101A    U+1039    U+101D    U+1031     Explicit Virama. The virama U+1039    also participates in some common constructions where it appears as a visible sign, commonly termed killer. In this usage where it appears as a visible diacritic, U+1039 is followed by a U+200C   -, as with Devanagari (see Figure 9-7). Signs After Consonants. Dependent vowels and other signs (except kinzi, as noted above) are encoded in logical order after the consonant to which they apply, regardless of where the glyph for the sign happens to be rendered relative to the glyph for the consonant. In particular, U+1031     is encoded after its consonant (as in the previ- ous example), although in visual presentation it is reordered to appear before (to the left of) the consonant form. Spacing. Myanmar does not use any whitespace between words. If word boundary indica- tions are desired—for example, for the use of automatic line layout algorithms—the char- acter U+200B    should be used to place invisible marks for such breaks. The    can grow to have a visible width when justified.

250 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.15 Khmer

9.15 Khmer

Khmer: U+1780–U+17FF Khmer, also known as Cambodian, is the official language of Kampuchea. Mutually intelli- gible dialects are also spoken in northeastern Thailand and the region of Vietnam. Although Khmer is not an Indo-European language, it has borrowed much vocabulary from Sanskrit and Pali, and religious texts in those languages have been translit- erated as well as translated into Khmer. The Khmer script, called a’saa kmae (“Khmer letters”), is descended from the Brahmi script of South India, as are Thai, Lao, Myanmar, Old Mon, and others. The exact sources have not been determined, but there is a great similarity between the earliest inscriptions in the region and the Pallawa script of the Coromandel coast of India. Modern Khmer has two basic styles of script: the a’saa criang (“slanted script”) and the a’saa muul (“round script”). There is no fundamental structural difference between the two. The slanted script (“stand- ing” variant) is chosen as representative in Chapter 14, Code Charts. Structurally, the Khmer script stays close to its southern Brahmi origins; in modern terms, it shares features with both Thai and Myanmar. Consonant letters bear an inherent vowel sound, with additional signs placed before, above, below, and after the consonants to indi- cate vowels other than the inherent one. Consonant clusters are represented by conjuncts, where the first consonant of the cluster maintains its full form and succeeding consonants are written as subscripts. The overall writing direction is left to right. The has a richer set of vowels than the languages for which the ancestral script was used, but it has a smaller set of consonant sounds. The Khmer script takes advantage of this situation by assigning different symbols to the same consonant but with different inherent vowels. This usage is analogous to the situation in Thai, where different consonant symbols have the same sound but encode different tones. The Khmer consonant letters are organized into two series or registers, whose inherent vowels are nominally “a” and “o”. Two “shifter” signs convert a consonant from one series to the other. The depen- dent vowel signs then do not have a single phonetic value, but rather are interpreted in the context of the consonant to which they are attached. Encoding Principles. As with other related scripts, the Khmer encoding represents only the basic underlying characters; multiple glyphs and rendering transformations are required to assemble the final visual form for each syllable. Even some single characters, such as U+1789   , may assume variant forms depending on the other characters with which they combine. Conversely, characters or combinations that may appear visually identical in some fonts, such as U+17A2    and U+17A3  -   , are distinguished by their underlying encoding. Independent Vowels. In Khmer, as in most related scripts (but in contrast to Thai), inde- pendent vowels have their own letterforms. Conjunct Consonants. U+17D2    plays the conjunct formation role of Indic virama, killing the vowel of its preceding consonant, and indicating that the following consonant should be treated as a subscript. Sign coeng should not be confused with U+17D1   , which has a name similar to virama but an unrelated func- tion—indicating that the base character is part of the previous word. Note that a subjoined U+179A    is typed normally—that is, after the pre- ceding consonant and sign coeng, even though the glyph for the subjoint letter ro is ren- dered before (to the left of) the full-size consonant. U+17CC   

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 251 9.15 Khmer South and Southeast Asian Scripts historically corresponds to the Devanagari repha (that is, an initial /r/), but it has lost this function in Khmer and instead is treated as a simple diacritical mark rather than as part of a conjunct. Consonant Shifters. The marks U+17C9    and U+17CA    are used to shift the base consonant between registers. In the presence of other superscript glyphs, both of these signs may be rendered via the same glyph shape as that of U+17BB    . Selection of the proper rendering form is left to the display software. Dependent Vowel Signs. Structurally, the Khmer dependent vowel signs are close to those of Thai, but the encoding principle is different. Khmer follows the general model for Brahmi-derived scripts, in which each dependent vowel sign is represented by a single code that occurs after the code for the base consonant. This principle reflects the fact that a vowel sign has its own identity, regardless of the number and placement of glyph fragments that may be used to render the sign, and regardless of its contextual phonetic interpreta- tion. Each Khmer consonant either expresses its implicit vowel or is followed by a single dependent vowel sign character. Ordering of Syllable Components. The standard order of components in consonantal syl- lables as expressed in BNF is C ( SC C )* { S } { V } { O } where C is a consonant SC is a sign coeng (= virama) S is a consonant shift V is a dependent vowel sign O is any other sign For example, the word /khnyom/, “I”, would represented by the following sequence: U+1781    U+17D2    U+1789    U+17B9     U+17C6    Except for the presence of consonant shifters, this structure is analogous to Devanagari (see Section 9.1, Devanagari). Special Character Usage. Although U+17A3    and U+17A2    have identical glyphs, only U+17A2 should be used for modern text. U+17A3 and U+17A4    are solely for Pali/Sanskrit transliteration. U+1789    may have two different subscript forms. Furthermore, the glyph part of the base letter U+1789, which normally occurs under the baseline, is omitted whenever a subscript is conjoined. U+17B1       bears a close relationship to U+17B2      . The very commonly used word meaning “give” should be spelled U+17B2 U+17D2 U+1799, with rendering software optionally replacing the glyph for U+17B1 with the glyph for U+17B2. U+17D3    is an exceedingly rare sign used in historic lunar dates; it should not be confused with U+17C6   , which carries

252 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.15 Khmer an “m” sound and occurs frequently in modern Khmer text. U+17C6   - , U+17C7   , and U+17C8    are signs that occur in many permutations with vowels; they are not vowels themselves. Spacing. Khmer does not use any whitespace between words. If word boundary indications are desired—for example, for the use of automatic line layout algorithms—the character U+200B    should be used to place invisible marks for such breaks. The    can grow to have a visible width when justified.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 253 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 10 East Asian Scripts 10

All of the writing systems of have as their root the ideographic script developed in China in the second millennium . Originally intended to write the various Chinese dia- lects, this script was exported throughout China’s sphere of influence and was used for cen- turies to write even non-Chinese languages such as Japanese, Korean, and Vietnamese. Speakers of other languages, such as Yi, created their own ideographs in imitation of Chinese. The is largely monosyllabic and noninflecting, and an ideographic writ- ing system suits it well. Ideographs are less well suited for unrelated languages. In Japan, this problem was solved by creating two syllabic scripts, hiragana and katakana. In Korea, an alphabetic system was invented, with letters grouped into ideograph-like syllabic blocks called hangul. Ideographs continued to be the exclusive way of writing Vietnamese until the twentieth century, when they were replaced with an alphabet based on the Latin script. They are still extensively used in Japanese, where they are called kanji, and somewhat more rarely in Korean, where they are called . In mainland China, the government has promoted the use of more modern, simplified forms of the ideographs over the older, more traditional forms used in Taiwan and overseas Chinese communities. Appendix A, Han Unification History, describes how the diverse typographic traditions of mainland China, Taiwan, Japan, Korea, and Vietnam have been reconciled to provide a common set of ideographs in the Unicode Standard for all of these languages and regions. The Unicode Standard includes a complete set of Korean hangul syllables, as well as the individual letters (jamos), which can be also be used to write Korean. Section 3.11, Conjoin- ing Jamo Behavior, describes how to use the conjoining jamos and how to convert between two methods for representing Korean. In all East Asian scripts, individual characters are written within uniformly sized squares. Diacritical marks are rarely used, although phonetic annotations are not uncommon. Tra- ditionally, East Asian scripts were written from the top of the page downward, with col- umns running right to left across the page. Under the influence of Western practices, a horizontal left-to-right writing sequence is now common. Many older character sets include characters intended to simplify the implementation of East Asian scripts, such as variant punctuation forms for text written vertically, halfwidth forms (which occupy only half a square), and fullwidth forms (which allow Latin letters to occupy a full square). These characters are included in the Unicode Standard for compati- bility with these older standards.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 257 10.1 Han East Asian Scripts

10.1 Han

CJK Unified Ideographs These blocks contain a set of Unified Han ideographic characters used in the written Chi- nese, Japanese, and Korean languages.1 The term Han, derived from the Chinese Han Dynasty, refers generally to Chinese traditional culture. The Han ideographic characters make up a coherent script, which was traditionally written vertically, with the vertical lines ordered from right to left. In modern usage, especially in technical works and in computer- rendered text, the Han script is written horizontally from left to right and is freely mixed with Latin or other scripts. When used in writing Japanese or Korean, the Han characters are interspersed with other scripts unique to those languages (Hiragana and Katakana for Japanese; Hangul syllables for Korean). The Han ideographic characters constitute a very large set, numbering in the tens of thou- sands. They have a long history of use in East Asia. Enormous compendia of Han ideo- graphic characters exist because of a continuous, millennia-long scholarly tradition of collecting all Han character citations, including variant, mistaken, and nonce forms, into annotated character dictionaries. Because of the large size of the Han ideographic character repertoire, and because of the particular problems the characters pose for standardizing their encoding, this character block description is more extended than that for other scripts and is divided into subsec- tions. The first three subsections (CJK Standards, Blocks, and Mapping to Standards) describe the character set standards used as sources, the way in which the Unicode Stan- dard divides ideographs into blocks, and some of the issues involved in mapping these characters to other character sets. These subsections are followed by an extended discus- sion of the characteristics of Han characters, with particular attention being paid to the problem of unification of encoding for characters used for different languages. Then there is a formal statement of the principles behind the Unified Han character encoding adopted in the Unicode Standard and order of their arrangement. For a detailed account of the background and history of development of the Unified Han character encoding, see also Appendix A, Han Unification History.

CJK Standards The Unicode Standard draws its Han character repertoire of 27,484 characters from a num- ber of character set standards. These standards are grouped into six sources, as indicated in Table 10-1. The primary work of unifying and ordering the characters from these sources was done by the Ideographic Rapporteur Group (IRG), a subgroup of ISO/IEC JTC1/SC2/ WG2. The G, T, J, K, and V sources represent the characters submitted to the IRG by its member bodies. The G source consists of submissions from mainland China, the Hong Kong SAR, and Singapore. The other four are the submissions from Taiwan, Japan, Korea, and Viet- nam, respectively. The U source represents character set standards that were not submitted to the IRG by any member body but which were used by the Unicode Consortium.

1. Although the term “CJK”—Chinese, Japanese, and Korean—is used throughout this text to describe the languages that currently use Han ideographic characters, it should be noted that earlier Vietnamese writing systems were based on Han ideographs. Conse- quently, the term “CJKV” would be more accurate in a historical sense. Han ideographs are still used for historical, religious, and pedagogical purposes in Vietnam.

258 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han

Table 10-1. Sources for Unified Han G source: G0 GB2312-80 G1 GB12345-90 with 58 Hong Kong and 92 Korean “Idu” characters G3 GB7589-87 unsimplified forms G5 GB7590-87 unsimplified forms G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified Hanzi GS Singapore Characters G8 GB8565-88 GE GB16500-95 T source: T1 CNS 11643-1992 1st plane T2 CNS 11643-1992 2nd plane T3 CNS 11643-1992 3rd plane with some additional characters T4 CNS 11643-1992 4th plane T5 CNS 11643-1992 5th plane T6 CNS 11643-1992 6th plane T7 CNS 11643-1992 7th plane TF CNS 11643-1992 15th plane J source: J0 JIS X 0208-1990 J1 JIS X 0212-1990 JA Unified Japanese IT Vendors Contemporary Ideographs, 1993 K source: K0 KS C 5601-1987 (unique ideographs) K1 KS C 5657-1991 K2 PKS C 5700-1 1994 K3 PKS C 5700-2 1994 V source: V0 TCVN 5773:1993 V1 TCVN 6056:1995 U source: KS C 5601-1987 (duplicate ideographs) ANSI Z39.64-1989 (EACC) Big-5 (Taiwan) CCCII, level 1 GB 12052-89 (Korean) JEF (Fujitsu) PRC Taiwan Telegraph Code (CCDC) Xerox Chinese Han Character Shapes Permitted for Personal Names (Japan) IBM Selected Japanese and Korean Ideographs

In some cases, the entire ideographic repertoire of the original character set standards was not included in the corresponding source. Three reasons explain this decision: 1. Where the repertoires of two of the character set standards within a single source have considerable overlap, the characters in the overlap might be included only once in the source. This approach is used, for example, with GB 2312-80 and GB 12345-90, which have many ideographs in common. Charac- ters in GB 12345-90 that are duplicates of characters in GB 2312-80 are not included in the G source. 2. Where a character set standard is based on unification rules that differ substan- tially from those used by the IRG, many variant characters found in the charac- ter set standard will not be included in the source. This situation is the case with CNS 11643-1992, EACC, and CCCII. It is the only case where full round- trip compatibility with the Han ideograph repertoire of the relevant character set standards is not guaranteed. 3. KS C 5601-1987 contains numerous duplicate ideographs included because they have multiple pronunciations in Korean. These multiply-encoded ideo- graphs are not included in the K source but are included in the U source to pro- vide full round-trip compatibility with KS C 5601-1987.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 259 10.1 Han East Asian Scripts

Blocks Ideographs are found in three blocks of the Unicode Standard: the CJK Unified Ideographs block (U+4E00–U+9FFF, common ideographs), the CJK Unified Ideographs Extension A block (U+3400–U+4DFF, rare ideographs), and the CJK Compatibility Ideographs block (U+F900–U+FAFF, duplicates, unifiable variants, and corporate characters). Characters in the CJK Unified Ideographs and CJK Unified Ideographs Extension A blocks are defined by the IRG and are derived entirely from the G, T, J, K, and V sources. The CJK Unified Ideographs block represents characters submitted to the IRG prior to 1992 and consists of commonly used characters. Characters in the CJK Unified Ideographs Extension A block are rarer, were submitted to the IRG between 1992 and 1998, and are not unifiable with characters in the CJK Unified Ideographs block. The only difference in the unification work done by the IRG on these two blocks is that the source separation rule was applied to the CJK Unified Ideographs block and not to the CJK Unified Ideographs Extension A block. This rule states that ideographs that are distinct in a source must not be unified. (For further discussion, see the subsection “Principles” later in this section.) Characters unique to the U source are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block are either duplicates or unifiable variants of char- acters in the one of the other blocks and are included in the Unicode Standard for reasons of round-trip compatibility.

General Characteristics of Han Ideographs The authoritative Kouzien defines Han characters to be characters that originated among the Chinese to write the Chinese lan- guage. They are now used in China, Japan, and Korea. They are logo- graphic (each character represents a word, not just a sound) characters that developed from pictographic and ideographic principles. They are also used phonetically. In Japan they are generally called kanzi (Han, that is, Chinese, characters) including the “national characters” (kokuzi) such as touge (mountain pass), which have been created using the same prin- ciples. They are also called mana (true names, as opposed to kana, or borrowed names).1 For many centuries, was the accepted written standard throughout East Asia. The influence of the Chinese language and its written form on the modern East Asian languages is similar to the influence of Latin on the vocabulary and written forms of lan- guages in the West. This influence is immediately visible in the mixture of Han characters and native phonetic scripts (kana in Japan, hangul in Korea) as now used in the orthogra- phies of Japan and Korea (see Table 10-2). The evolution of character shapes and semantic drift over the centuries has resulted in changes to the original forms and meanings. For example, the Chinese character tang (Japanese tou or yu, Korean thang), which originally meant “hot water,” has come to mean “soup” in Chinese. “Hot water” remains the primary meaning in Japanese and Korean, whereas “soup” appears in more recent borrowings from Chinese, such as “soup noodles”

1. Lee Collins’ translation from the Japanese, Kouzien, Izuru, Shinmura, ed. (Tokyo: Iwan- ami Syoten, 1983).

260 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han

Table 10-2. Common Han Characters

English Han Character Chinesea Japanese Korean Translation tian1 ten, ame chen heaven,

di4 ti, tuti ci earth, ground

ren2 zin, hito in man, person

shan1 san, yama san mountain

shui3 sui, mizu swu water

shang4 zyou, ue sang above

xia4 ka, sita ha below

a. The superscripted numbers in this table represent Chinese (Mandarin) tone marks.

(Japanese tanmen; Korean thangmyen). Still, the identical appearance and similarities in meaning are dramatic and more than justify the concept of a unified Han script that tran- scends language. The “nationality” of the Han characters became an issue only when each country began to create coded character sets (for example, China’s GB 2312-80, Japan’s JIS X 0208-1978, and Korea’s KS C 5601-87) based on purely local needs. This problem appears to have arisen more from the priority placed on local requirements and lack of coordination with other countries, rather than out of conscious design. Nevertheless, the identity of the Han char- acters is fundamentally independent of language, as shown by dictionary definitions, vocabulary lists, and encoding standards. Terminology. Several standard of the term used to refer to East Asian ideo- graphic characters are commonly used. They include hanzi (Chinese), kanzi (Japanese), kanji (colloquial Japanese), hanja (Korean), and ChÔ hán (Vietnamese). The standard English for these terms are interchangeable: Han character, Han ideographic character, East Asian ideographic character, or CJK ideographic character. For the purpose of clarity, the Unicode Standard uses some subset of the English terms when referring to these characters. The term Kanzi is used in reference to a specific Japanese government publication. The unrelated term KangXi (which is a Chinese reign name, rather than another romanization of “Han character”) is used only when referring to the dictionary on which the Unified Repertoire and Ordering, Version 2.0, was based. Distinguishing Han Character Usage Between Languages. There is some concern that unifying the Han characters may lead to confusion because they are sometimes used differ- ently by the various East Asian languages. Computationally, Han character unification pre- sents no more difficulty than employing a single Latin character set that is used to write languages as different as English and French. Programmers do not expect the characters “c”, “h”, “a”, and “t” alone to tell us whether chat is a French word for cat or an English word meaning “informal talk.” Likewise, we depend on context to identify the American hood (of a car) with the British bonnet. Few computer users are confused by the fact that ASCII can also be used to represent such words as the Welsh word ynghyd, which are strange look- ing to English eyes. Although it would be convenient to identify words by language for pro- grams such as spell-checkers, it is neither practical nor productive to encode a separate Latin character set for every language that uses it.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 261 10.1 Han East Asian Scripts

Similarly, the Han characters are often combined to “spell” words whose meaning may not be evident from the constituent characters. For example, the two characters “to cut” and “hand” mean “postal stamp” in Japanese, but the compound may appear to be nonsense to a speaker of Chinese or Korean (see Figure 10-1). Figure 10-1. Han Spelling

+ 1. Japanese “stamp” = to cut hand 2. Chinese “cut hand”

Even within one language, a computer requires context to distinguish the meanings of words represented by coded characters. The word chuugoku in Japanese, for example, may refer to China or to a district in central west Honshuu (see Figure 10-2). Figure 10-2. Context for Characters

+ 1. China = middle country 2. Chuugoku district of Honshuu

Coding these two characters as four so as to capture this distinction would probably cause more confusion and still not provide a general solution. The Unicode Standard leaves the issues of language tagging and word recognition up to a higher level of software and does not attempt to encode the language of the Han characters. Sorting Han Ideographs. The Unicode Standard does not define a method by which ideo- graphic characters are sorted; the requirements for sorting differ by locale and application. Possible collating sequences include phonetic, radical-stroke (KangXi, Xinhua Zidian, and so on), four-corner, and total stroke count. Raw character codes alone are seldom sufficient to achieve a usable ordering in any of these schemes; ancillary data are usually required. (See Table 10-5.) Character Glyphs. In form, Han characters are monospaced. Every character takes the same vertical and horizontal space, regardless of how simple or complex its particular form is. This practice follows from the long history of printing and typographical practice in China, which traditionally placed each character in a square cell. When written vertically, there are also a number of named cursive styles for Han characters, but the cursive forms of the characters tend to be quite idiosyncratic and are not implemented in general-purpose Han character fonts for computers. There may be a wide variation in the glyphs used in different countries and for different applications. The most commonly used typefaces in one country may not be used in others. The types of glyphs used to depict characters in the Han ideographic repertoire of the Uni- code Standard have been constrained by available fonts. Users are advised to consult authoritative sources for the appropriate glyphs for individual markets and applications. It is assumed that most Unicode implementations will provide users with the ability to select the font (or mixture of fonts) that is most appropriate for a given locale.

Principles Three-Dimensional Conceptual Model. To develop the explicit rules for unification, a conceptual framework was developed to model the nature of Han ideographic characters. This model expresses written elements in terms of three primary attributes: semantic

262 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han

(meaning, function), abstract shape (general form), and actual shape (instantiated, type- face form). These attributes are graphically represented in three dimensions according to the X, Y, and Z axes (see Figure 10-3). Figure 10-3. Three-Dimensional Conceptual Model

€

„1 „2 ƒ ‚ † ‡

Y (abstract shape) Z (type face) 

X (semantic)

The semantic attribute (represented along the X axis) distinguishes characters by meaning and usage. Distinctions are made between entirely unrelated characters such as ! (marsh) and  (machine) as well as extensions or borrowings beyond the original semantic cluster    such as 1 (a phonetic borrowing used as a simplified form of ) and 2 (table, the orig- inal meaning). The abstract shape attribute (the Y axis) distinguishes the variant forms of a single charac- ter with a single semantic attribute (that is, a character with a single position on the X axis). The actual shape (typeface) attribute (the Z axis) is for differences of type design (the actual shape used in imaging) of each . Only characters that have the same abstract shape (that is, occupy a single point on the X and Y axes) are potential candidates for unification. Z axis typeface and stylistic differences are generally ignored. Unification Rules. The following rules were applied during the process of merging Han characters from the different source character sets: R1 Source Separation Rule. If two ideographs are distinct in a primary source stan- dard, then they are not unified. • This rule is sometimes called the round-trip rule because its goal is to facilitate a round-trip conversion of character data between an IRG source standard and the Unicode Standard without loss of information. • This rule was applied only for the work on the original Unified Repertoire and Ordering (URO). The IRG dropped this rule in 1992 and will not use it in future work. For example, the ideographs from the URO shown in Figure 10-4 would normally be sub- ject to unification by rule R3; however, their unification is prevented because they are dis- tinct in the primary source standard J0 (JIS X 0208-1990). R2 Noncognate Rule. In general, if two ideographs are unrelated in historical deriva- tion (noncognate characters), then they are not unified.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 263 10.1 Han East Asian Scripts

Figure 10-4. Preserving Variants

“sword”

For example, the following ideographs (in Figure 10-5), although visually quite similar, are nevertheless not unified because they are historically unrelated and have distinct meanings. Figure 10-5. Not Cognates, Not Unified ≠ earth warrior, scholar

R3 By means of a two-level classification (described next), the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are then unified provided that their unification is not disallowed by either the source separation rule or the noncognate rule. Two-Level Classification. Using the three-dimensional model, characters are analyzed in a two-level classification. The two-level classification distinguishes characters by abstract shape (Y axis) and actual shape of a particular typeface (Z axis). Variant forms are identi- fied based on the difference of abstract shapes. To determine differences in abstract shape and actual shape, the structure and features of each component of an ideograph are analyzed as follows. Ideograph Component Structure. The component structure of each ideograph is exam- ined. A component is a geometrical combination of primitive elements. Various ideo- graphs can be configured with these components used in conjunction with other components. Some components can be combined to make a component more complicated in its structure. Therefore, an ideograph can be defined as a component tree with the entire ideograph as the root node and with the bottom nodes consisting of primitive elements (see Figure 10-6 and Figure 10-7). Figure 10-6. Component Structure

Figure 10-7. The Most Superior Node of a Component

vs.

vs.

vs.

264 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han

Ideograph Features. The following features of each ideograph to be compared are exam- ined: • Number of components • Relative position of components in each complete ideograph • Structure of a corresponding component • Treatment in a source character set • Radical contained in a component Uniqueness. If one or more of these features are different between the ideographs com- pared, the ideographs are considered to have different abstract shapes and therefore are considered unique characters and are not unified. Unification. If all of these features are identical between the ideographs, the ideographs are considered to have the same abstract shape and are therefore unified. The examples in Table 10-3 represent some typical differences in abstract character shape. The ideographs are therefore not unified. Table 10-3. Ideographs Not Unified

Characters Reason D Å E Different Number of Components F Å G Same Number of Components Placed in Different Relative Position H Å I Same Number and Same Relative Position of Components, Corresponding Components Structure Differently J Å K Characters Treated Differently in a Source Character Set L Å M Characters with Different Radical in a Component N Å O Same Abstract Shape, Difference in Actual Shape

Differences in actual shape of ideographs that have been unified are illustrated in Table 10-4. Table 10-4. Ideographs Unified

Characters Reason P ­ Q Different Writing Sequence T ­ U Differences in Overshoot at the Stroke Termination V ­ W Differences in Contact of Strokes X ­ Y Differences in Protrusion at the Folded Corner of Strokes Z ­ [ Differences in Bent Strokes \ ­ ] Differences in Stroke Termination 3 ­ 4 Differences in Accent at the Stroke Initiation a ­ 7 Difference in Rooftop Modification R ­ S Difference in Rotated Strokes/Dotsa a. These ideographs (having the same abstract shape) would have been unified except for the source separation rule.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 265 10.1 Han East Asian Scripts

Han Ideograph Arrangement. The arrangement of the Unicode Han characters is based on the position of characters as they are listed in four major dictionaries. The KangXi Zidian was chosen as primary because it contains most of the source characters and because the dictionary itself and the principles of character ordering it employs are commonly used throughout East Asia. The Han ideograph arrangement follows the index (page and position) of the dictionaries listed in Table 10-5 with their priorities. Table 10-5. Han Ideograph Arrangement

Priority Dictionary City Publisher Version 1 KangXi Zidian Beijing Zhonghua Bookstore, 1989 7th edition 2 Dai Kan-Wa Jiten Tokyo Taishuukan Shoten, 1986 Revised edition 3 Chengdu Sichuan Cishu Publishing, 1986 1st edition 4 Dae Jaweon Seoul Samseong Publishing Co. Ltd, 1988 1st edition

When a character is found in the KangXi Zidian, it follows the KangXi Zidian order. When it is not found in the KangXi Zidian and it is found in Dai Kan-Wa Jiten, it is given a posi- tion extrapolated from the KangXi position of the preceding character in Dai Kan-Wa Jiten. When it is not found in either KangXi or Dai Kan-Wa, then the Hanyu Da Zidian and Dae Jaweon dictionaries are consulted in a similar manner. Ideographs with simplified KangXi radicals are placed in a group following the traditional from which the simplified radical is derived. For example, characters with the simplified radical corresponding to KangXi radical follow the last nonsimplified character having as a radical. The arrangement for these simplified characters is that of the Hanyu Da Zidian. The few characters that are not found in any of the four dictionaries are placed following characters with the same KangXi radical and stroke count. The radical-stroke order that results is a culturally neutral order. It does not exactly match the order found in common dictionaries. Information for sorting all CJK ideographs by the radical-stroke method is found on the CD-ROM. It should be used if characters from the CJK Unified Ideographs and CJK Unified Ideographs Extension A blocks and compatibility ideographs are to be properly interleaved. The form of the charts for the CJK Unified Ideographs block is described in the introduc- tion to Chapter 14, Code Charts. A full radical-stroke index is also provided in Section 15.1, Han Radical-Stroke Index, to help users locate characters in the main charts.

Mapping to Standards The mappings defined by the IRG between the ideographs in the Unicode Standard and the IRG sources are included on the CD-ROM. These mappings are considered to be norma- tive parts of ISO/IEC 10646-1; that is, the characters are defined to be the targets for conver- sion of these characters in these character set standards. These mappings have as their source editions of the relevant national standards that were provided directly to the IRG by its member bodies. In some cases, these editions differ from the published editions generally available. The mappings defined by the IRG are considered normative for ISO/IEC 10646-1, and must be considered normative for the Unicode Standard as well. Because they may not match the mappings derived from published editions of the standards, developers of indi- vidual mapping tables, which are intended to handle real-life intercharacter set mappings,

266 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han may choose to use alternative mappings more directly correlated with published character set editions. The Unicode Consortium provides tables indicating where the variant map- pings may be desirable. Specialized conversion systems may also choose more sophisticated mapping mecha- nisms—for example, semantic conversion, variant normalization, or conversion between simplified and traditional Chinese. The Unicode Consortium also provides mapping information that extends beyond the normative mappings defined by the IRG. These additional mappings include mappings to character set standards included in the U source, including duplicate characters from KS C 5601-1987, mappings to portions of character set standards omitted from IRG sources, ref- erences to standard dictionaries, and suggested character/stroke counts.

CJK Compatibility Ideographs: U+F900–U+FAFF The Korean national standard KS C 5601-1987, which served as one of the primary source sets for the Unified CJK Ideograph Repertoire and Ordering, Version 2.0, contains 268 duplicate encodings of identical ideograph forms to denote alternative pronunciations. That is, in certain cases, the standard encoded a single character multiple times to denote different linguistic uses. This approach is like encoding the letter “a” five times to denote the different pronunciations it has in the words hat, able, art, father, and adrift. They are in all ways identical in shape to their nominal counterparts, and so were excluded by the IRG from its sources. For round-trip conversion with KS C 5601-1987, they are encoded sepa- rately from the primary CJK Unified Ideographs block. In addition, another 34 ideographs from various regional and industry standards were encoded in this block, primarily to achieve round-trip conversion compatibility. Twelve of these 34 ideographs (U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29) are not encoded in the CJK Unified Ideographs Areas. These 12 characters are not duplicates and should be treated as a small extension to the set of unified ideographs.

Kanbun: U+3190–U+319F This block contains a set of marks used in Japanese texts to indicate the Japanese reading order of classical Chinese texts. They are not encoded in any current character encoding standards but are widely used in literature. They are typically written in an anno- tation style to the left of each line of vertically rendered Chinese text. See also enclosed CJK letters and months (U+3200..U+32FF) and CJK compatibility (U+3300..U+33FF).

CJK and KangXi Radicals: U+2E80–U+2FD5 East Asian ideographic radicals are ideographs or fragments of ideographs used to index dictionaries and word lists, and as the basis for creating new ideographs. The term radical comes from the Latin radix, meaning root, and refers to the part of the character under which the character is classified in dictionaries. Section 15.1, Han Radical-Stroke Index, pro- vides information on how to use radical-stroke lookup to locate ideographs encoded in the Unicode Standard.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 267 10.1 Han East Asian Scripts

There is no single radical set in general use throughout East Asia; however, the set of 214 radicals used in the eighteenth-century KangXi dictionary is universally recognized. The visual appearance of radicals is often very different when they are used as radicals from what it is when they are stand-alone ideographs. Indeed, many radicals have multiple graphic forms when used as parts of characters. A standard example is the water radical, which is written $ when an ideograph and generally % when part of an ideograph. The Unicode Standard includes two blocks of encoded radicals: The KangXi Radicals block (U+2F00 through U+2FD5), which contains the base forms for the 214 radicals, and the CJK Radicals Extension block (U+2E80 through U+2EF3), which contains a set of variant shapes taken by the radicals either when they occur as parts of characters or are used for simplified Chinese. These variant shapes are commonly found as independent and distinct characters in dictionary indices—such as for the radical-stroke charts in the Unicode Stan- dard. As such, they have not been subject to the usual unification rules used for other char- acters in the standard. Most of the radicals in the CJK and KangXi Radicals blocks are also part of the CJK Unified Ideographs block of the Unicode Standard. Radicals that have one graphic form as an ideo- graph and another as part of an ideograph are generally encoded in both forms in the CJK Unified Ideographs block (such as U+6C34 and U+6C35 for the water radical). Standards. CNS 11643-1992 includes a block of radicals separate from its ideograph block. This block includes of 212 of the 214 KangXi radicals. These characters are included in the KangXi Radicals block. Those radicals that are ideographs in their own right have a definite meaning and are usu- ally referred to by that meaning. Accordingly, most of the characters in the KangXi Radicals block have been assigned names reflecting their meaning. The other radicals have been given names based on their shape. Semantics. Characters in the CJK and KangXi Radicals blocks should never be used as ideographs. They have different properties and meaning. U+2F00    is not equivalent to U+4E00    4E00. The former is to be treated as a symbol, the latter as a word or part of a word. It is necessary to make a semantic distinction between a character used as an ideograph and the same character used as a radical. To emphasize this difference, radicals may be given a distinct font style from their ideographic counterparts.

Ideographic Description: U+2FF0–U+2FFB Although the Unicode Standard includes over 27,000 ideographs, many thousands of extremely rare ideographs were nevertheless left unencoded. As an example, nearly half of the characters in the KangXi dictionary are unencoded. Research into cataloging additional ideographs for encoding continues, but it is anticipated that at no point will the entire set of potential, encodable ideographs be completely exhausted. In particular, ideographs con- tinue to be coined and such new coinages will invariably be unencoded. The 12 characters in the Ideographic Description block provide a mechanism for the stan- dard interchange of text that must reference unencoded ideographs. Unencoded ideo- graphs can be described using these characters and encoded ideographs; the reader can then create a mental picture of the ideographs from the description. This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideo- graphs; there is no equivalence defined for described ideographs. Conceptually, ideograph

268 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han descriptions are more akin to the English phrase, “an ‘e’ with an acute accent on it,” than to the character sequence “U+006E U+0301.” In particular, support for the characters in the Ideographic Description block does not require the rendering engine to recreate the graphic appearance of the described character. Note also that many of the ideographs that users might represent using the Ideographic Description characters will be formally encoded in future versions of the Unicode Stan- dard. The Ideographic Description algorithm depends on the fact that virtually all CJK ideo- graphs can be broken down into smaller pieces which are themselves ideographs. The broad coverage of the ideographs already encoded in the Unicode Standard implies that the vast majority of unencoded ideographs can be represented using the Ideographic Descrip- tion characters. Ideographic Description Sequences. Ideographic Description Sequences are defined by the following grammar. See Section 0.2, Notational Conventions, for the notational conventions used here. IDS ::= UnifiedIdeograph | Radical | BinaryDescriptionOperator IDS IDS | TrinaryDescriptionOperator IDS IDS IDS BinaryDescriptionOperator ::= U+2FF0 | U+2FF1 | U+2FF4 | U+2FF5 | U+2FF6 | U+2FF7 | U+2FF8 | U+2FF9 | U+2FFA | U+2FFB TrinaryDescriptionOperator::= U+2FF2 | U+2FF3 Radical ::= U+2E80 | U+2E81 | ... | U+2EF2 | U+2EF3 | U+2F00 | U+2F01 | ... | U+2FD4 | U+2FD5 UnifiedIdeograph ::= U+3400 | U+3401 | ... | U+4DB4 | U+4DB5 | U+4E00 | U+4E01 | ... | U+9FA4 | U+9FA5 | U+FA0E |U+FA0F | U+FA11 | U+FA13 | U+FA14 | U+FA1F |U+FA21 | U+FA23 | U+FA24 | U+FA27 | U+FA28 |U+FA29 In addition to the above grammar, Ideographic Description Sequences have two additional length constraints: • No sequence can be longer than 16 Unicode scalar values in length. • No sequence can contain more than six UnifiedIdeographs in a row without an intervening Ideographic Description character. A sequence of characters that includes Ideographic Description characters but does not conform to the above grammar and length constraints is not an Ideographic Description Sequence. The operators indicate the relative graphic positions of the operands running from left to right and from top to bottom. Note that non-unique compatibility ideographs (U+F900 through U+FA2D, but not U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, or U+FA29) are not counted as unified ideographs for the purposes of this grammar, although they do have the ideographic property (see Section 4.10, Letters and Other Useful Properties). Non-unique compatibility ideographs are excluded from Ideo- graphic Description Sequences to incrementally reduce the ambiguity of such sequences. Figure 10-8 illustrates the use of this grammar to provide descriptions of unencoded ideo- graphs. A user wishing to represent an unencoded ideograph will need to analyze its structure to determine how to describe it using an Ideographic Description Sequence. As a rule, it is best to use the natural radical-phonetic division for an ideograph if it has one and to use as

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 269 10.1 Han East Asian Scripts

Figure 10-8. Using the Ideographic Description Characters

➡ ‚ ➡  ‚ U+2FF1U+4E95 U+86D9 U+2FF0U+6C34 U+2FF1 U+53E3 U+5DDB ➡ ‚  ➡  ‚ U+2FF1 U+4E95 U+2FF0U+866B U+572D U+2FF0U+6C35 U+2FF1 U+53E3 U+5DDB ➡ ‚  ‚ ➡  ‚ U+2FF1 U+4E95 U+2FF0 U+866B U+2FF1 U+571F U+571F U+2FF0U+2F54 U+2FF1 U+53E3 U+5DDB ➡ ‰ ‚ ➡  ‚ U+2FF8U+5382 U+2FF1 U+4ECA U+6B62 U+2FF0U+2EA1 U+2FF1 U+53E3 U+5DDB short a description sequence as possible, but there is no requirement that these rules be fol- lowed. Beyond that, the shortest possible Ideographic Description Sequence is preferred. The length constraints allow random access into a string of ideographs to have well-defined limits. Only a small number of characters need to be scanned backward to determine whether those characters are part of an Ideographic Description Sequence. The fact that Ideographic Description Sequences can contain other Ideographic Descrip- tion Sequences means that implementations may need to be aware of the recursion depth of a sequence and its back-scan length. The recursion depth of an Ideographic Description Sequence is the maximum number of pending operations encountered in the process of parsing an Ideographic Description Sequence. In Figure 10-8, the maximum recursion depth is shown in the third example, where three operations are still pending at the end of the Ideographic Description Sequence. The back-scan length is the maximum number of ideographs unbroken by Ideographic Description characters in the sequence. None of the examples in Figure 10-8 has more than two ideographs in a row; for all, the back-scan depth is two. The Unicode Standard places no formal limits on the recursion depth of Ideographic Description Sequences. It does, however, limit the back-scan depth for valid Ideographic Description Sequences to be six or less. Equivalence. Many unencoded ideographs can be described in more than one way using this algorithm, either because the pieces of a description can themselves be broken down further (examples one through three in Figure 10-8), or because duplications appear within the Unicode Standard (examples five and six in Figure 10-8). The Unicode Standard does not define equivalence for two Ideographic Description Sequences that are not identical. Figure 10-8 contains numerous examples illustrating how different Ideographic Description Sequences might be used to describe the same ideo- graph. In particular, Ideographic Description Sequences are not to be used to provide alternative graphic representations of encoded ideographs. Searching, collation, and other content- based text operations would then fail. Interaction with the Ideographic Variation Mark. As with ideographs proper, the Ideo- graphic Variation Mark (U+303E) may be placed before an Ideographic Description Sequence to indicate that the description is only an approximation of the original ideo- graph desired. A sequence of characters that includes an Ideographic Variation Mark is not an Ideographic Description Sequence. Rendering. Ideographic Description characters are visible characters. They are not to be treated as control characters. The sequence U+2FF1 U+4E95 U+86D9 must have a distinct appearance from U+4E95 U+86D9.

270 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han

An implementation may render a valid Ideographic Description Sequence either by render- ing the individual characters separately, or by parsing the Ideographic Description Sequence and drawing the ideograph so described. In the latter case, the Ideographic Description Sequence should be treated as a ligature of the individual characters for pur- poses of hit testing, cursor movement, and other user interface operations. (See Section 5.12, Editing and Selection.) Character Boundaries. Ideographic Description characters are not combining characters, and there is no requirement that they affect character or word boundaries. Thus U+2FF1 U+4E95 U+86D9 may be treated as a sequence of three characters or even three words. Implementations of the Unicode Standard may choose to parse Ideographic Description Sequences when calculating word and character boundaries, but such a decision will make the algorithms involved significantly more complicated and slower. Standards. The Ideographic Description characters are found in GBK—an extension to GB 2312-80 that adds all Unicode ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 271 10.2 Hiragana East Asian Scripts

10.2 Hiragana

Hiragana: U+3040–U+309F Hiragana is the cursive used to write Japanese words phonetically and to write sentence particles and inflectional endings. It is also commonly used to indicate the pro- nunciation of Japanese words. Hiragana syllables are phonetically equivalent to corre- sponding Katakana syllables. Standards. The Hiragana block is based on the JIS X 0208-1990 standard, extended by the nonstandard syllable U+3094 , which is included in some Japanese corporate standards. Combining Marks. Hiragana and the related script Katakana use U+3099  -    and U+309A  -  -   to generate voiced and semi-voiced syllables from the base syllables, respectively. All common precomposed combinations of base syllable forms using these marks are already encoded as characters, and use of these precomposed forms is the predominant JIS usage. These combining marks must follow the base character to which they apply. As most implementations and JIS standard treat these marks as spac- ing characters, the Unicode Standard also contains two corresponding noncombining (spacing) marks at U+309B and U+309C. Iteration Marks. The two characters U+309D    and U+309E     are punctuation-like characters that denote the itera- tion (repetition) of a previous syllable according to whether the repeated syllable has an unvoiced or voiced consonant, respectively.

272 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.3 Katakana

10.3 Katakana

Katakana: U+30A0–U+30FF Katakana is the noncursive syllabary used to write non-Japanese (usually Western) words phonetically in Japanese. It is also used to write Japanese words with visual emphasis. Kata- kana syllables are phonetically equivalent to corresponding Hiragana syllables. Katakana contains two characters, U+30F5     and U+30F6    , that have no direct correspondent in Hiragana; they are used in special Japanese spelling conventions (for example, the spelling of placenames that include archaic Japanese connective particles). Standards. The Katakana block is based on the JIS X 0208-1990 standard. Punctuation-like Characters. U+30FB    is used to separate words when writing non-Japanese phrases. U+30FC -    is used predominantly with Katakana and occasionally with Hiragana to denote a lengthened vowel of the previously written syllable. The two iteration marks, U+30FD    and U+30FE    , serve the same function in Katakana writing that the two Hiragana iteration marks serve in Hiragana writing.

Halfwidth and Fullwidth Forms: U+FF00–U+FFEF In the context of East Asian coding systems, a double-byte character set (DBCS) such as JIS X 0208-1990 or KS C 5601-1987 is generally used together with a single-byte character set (SBCS), such as ASCII or a variant of ASCII. Text that is encoded with both a DBCS and SBCS is typically displayed such that the glyphs representing DBCS characters occupy two display cells where a display cell is defined in terms of the glyphs used to display the SBCS (ASCII) characters. In these systems, the two-display-cell width is known as the fullwidth or zenkaku form, and the one-display-cell width is known as the halfwidth or hankaku form. Because of this mixture of display widths, certain characters often appear twice, once in fullwidth form in the DBCS repertoire and once in halfwidth form in the SBCS repertoire. To achieve round-trip conversion compatibility with such mixed encoding systems, it is necessary to encode both fullwidth and halfwidth forms of certain characters. This block consists of the additional forms needed to support conversion for existing texts that employ both forms. In the context of conversion to and from such mixed width encodings, all characters in the General Scripts Area should be construed as halfwidth (hankaku) characters if they have a fullwidth equivalent elsewhere in the standard or if they do not occur in the mixed width encoding; otherwise, they should be construed as fullwidth (zenkaku). Specifically, most characters in the CJK Phonetics and Symbols Area and the Unified CJK Ideograph Area, along with the characters in the CJK Compatibility Ideographs, CJK Compatibility Forms, and Small Form Variants blocks, should be construed as fullwidth (zenkaku) characters. For a complete description of the East Asian Width property, see Unicode Technical Report #11, “East Asian Width,” on the CD-ROM or the up-to-date version on the Unicode Web site. The characters in this block consist of fullwidth forms of the ASCII block (except ), certain characters of the Latin-1 Supplement, and some currency symbols. In addition, this

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 273 10.3 Katakana East Asian Scripts block contains halfwidth forms of the Katakana and Hangul Compatibility Jamo charac- ters. Finally, a number of characters from the Symbols Area are replicated here (U+FFE8..U+FFEE) with explicit halfwidth semantics. As with other compatibility characters, the preferred Unicode encoding is to use the nomi- nal counterparts of these characters and use rich text font or style bindings to select the appropriate glyph size and width. Unifications. The fullwidth form of U+0020  is unified with U+3000  .

274 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.4 Hangul

10.4 Hangul

Hangul Jamo: U+1100–U+11FF Korean Hangul may be considered to be a syllabic script. As opposed to many other syllabic scripts, the syllables are formed from a set of alphabetic components in a regular fashion. These alphabetic components are called jamo. The Unicode Standard contains both the complete set of precomposed modern Hangul syl- lable blocks and the set of conjoining Hangul jamo in this block. This set of conjoining Hangul jamo can be used to encode all modern and ancient syllable blocks. For a descrip- tion of conjoining jamo behavior and precomposed Hangul Syllables, see Section 3.11, Conjoining Jamo Behavior, and the Hangul Syllables character block description (U+AC00..U+D7A3). The Hangul jamo are divided into three classes: choseong (leading consonants, or syllable- initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). In the following discussion, these classes are abbreviated as L (leading consonant), V (vowel), and T (trailing consonant). For use in composition, two invisible filler characters act as placeholders for choseong or jungseong: U+115F    and U+1160   . Collation. The unit of collation in Korean text is normally the Hangul syllable block. Because of the arrangement of the conjoining jamo, their sequences may be collated with a binary comparison. For example, in comparing (a) LVTLV with (b) LVLV, the first sylla- ble block (LVT ) should be compared with the second (LV). Supposing the first two char- acters are identical—because all trailing consonants have binary values greater than all leading consonants—the T would compare as greater than the second L in (b). This result produces the correct ordering between the strings. The positions of the fillers in the code charts were also chosen with this condition in mind. • As with any coded characters, collation cannot depend simply on a binary com- parison. Odd sequences such as superfluous fillers will produce an incorrect sort, as will cases where a non-jamo character follows a sequence (such as com- paring LVT with LVx, where x is a Unicode character above U+11FF, such as U+3000  ). If mixtures of precomposed syllable blocks and jamo are collated, the easiest approach is to decompose the precomposed syllable blocks into conjoining jamo before comparing.

Hangul Compatibility Jamo: U+3130–U+318F This block consists of spacing, nonconjoining Hangul consonant and vowel (jamo) ele- ments. These characters are provided solely for compatibility with the KS C 5601 standard. Unlike the characters found in the Hangul Jamo block (U+1100..U+11FF), the jamo char- acters in this block have no conjoining semantics. The characters of this block are considered to be fullwidth forms in contrast with the half- width Hangul Compatibility Jamo found at U+FFA0..U+FFDF. Standards. The Unicode Standard follows KS C 5601 for Hangul Jamo elements.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 275 10.4 Hangul East Asian Scripts

Hangul Syllables: U+AC00–U+D7A3 The Hangul script used in the Korean writing system consists of individual consonant and vowel letters (jamo) that are visually combined into square display cells to form entire syl- lable blocks. Hangul syllables may be encoded directly as precomposed combinations of individual jamo or as decomposed sequences of conjoining jamo. The latter encoding is supported by the Hangul Jamo block (U+1100..U+11FF). The syllabic encoding method is described here. Modern Hangul syllable blocks can be expressed with either two or three jamo, either in the form consonant + vowel or in the form consonant + vowel + consonant. There are 19 possi- ble leading (initial) consonants (choseong), 21 vowels (jungeong), and 27 trailing (final) consonants (jongseong). Thus there are 399 possible two-jamo syllable blocks and 10,773 possible three-jamo syllable blocks, for a total of 11,172 modern Hangul syllable blocks. This collection of 11,172 modern Hangul syllables encoded in this block is known as the Johab set. Standards. The Hangul syllables are taken from KS C 5601-1992, representing the full Johab set. This group represents a superset of the Hangul syllables encoded in earlier ver- sions of Korean standards (KS C 5601-1987, KS C 5657-1991). Equivalence. Each of the Hangul syllables encoded in this block may be encoded by an equivalent sequence of conjoining jamo; however, the converse is not true because thou- sands of archaic Hangul syllables may be encoded only as a sequence of conjoining jamo. Implementations that use a conjoining jamo encoding are able to represent these archaic Hangul syllables. Hangul Syllable Composition. The Hangul syllables can be derived from conjoining jamo by a regular process of composition. The algorithm that maps a sequence of conjoining jamo to the encoding point for a Hangul syllable in the Johab set is detailed in Section 3.11, Conjoining Jamo Behavior. Hangul Syllable Decomposition. Conversely, any Hangul syllable from the Johab set can be decomposed into a sequence of conjoining jamo characters. The algorithm that details the formula for decomposition is provided in Section 3.11, Conjoining Jamo Behavior. Hangul Syllable Name. The character names for Hangul syllables are derived algorithmi- cally from the decomposition. (For full details, see Section 3.11, Conjoining Jamo Behavior.) Hangul Syllable Representative Glyph. The representative glyph for a Hangul syllable can be formed from its decomposition based on the categorization of vowels shown in Table 10-6. Table 10-6. Line-Based Placement of Jungseong

Vertical Horizontal Horizontal and Vertical 1161  1169  116A  1162  116D  116B  1163  116E  116C  1164  1172  116F  1165  1173  1170  1166  1171  1167  1174  1168  1175 

276 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.4 Hangul

If the vowel of the syllable is based on a vertical line, place the leading consonant to its left. If the vowel is based on a horizontal line, place the preceding consonant above it. If the vowel is based on a combination of vertical and horizontal lines, place the preceding conso- nant above the horizontal line and to the left of the vertical line. In either case, place a fol- lowing consonant, if any, below the middle of the resulting group. In any particular font, the exact placement, shape, and size of the components will vary according to the shapes of the other characters and the overall design of the font. See also enclosed CJK letters and months (U+3200..U+32FF), CJK compatibility (U+3300..U+33FF), and halfwidth and fullwidth forms (U+FF00..U+FFEF).

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 277 10.5 Bopomofo East Asian Scripts

10.5 Bopomofo

Bopomofo: U+3100–U+312F Bopomofo constitute a set of characters used to annotate or teach the phonetics of Chinese, primarily the standard Mandarin language. The characters are used in dictionaries and teaching materials, but not in the actual writing of Chinese text. The formal Chinese names for this alphabet are Zhuyin-Zimu (“phonetic alphabet”) and Zhuyin-Fuhao (“phonetic symbols”), but the informal term “Bopomofo” (analogous to “ABCs”) provides a more ser- viceable English name and is also used in China. The Bopomofo were developed as part of a populist literacy campaign following the 1911 revolution; thus they are acceptable to all branches of modern Chinese culture, although in the People’s Republic of China their function has been largely taken over by the Pinyin romanization system. Standards. The standard Mandarin set of Bopomofo is included in the People’s Republic of China standard GB 2312 and in the Republic of China (Taiwan) standard CNS 11643. Mandarin Tone Marks. Small modifier letters used to indicate the five Mandarin tones are part of the Bopomofo system. In the Unicode Standard they have been unified into the Modifier Letter range, as shown in Table 10-7. Table 10-7. Mandarin Tone Marks first tone U+02C9 MODIFIER LETTER MACRON second tone U+02CA MODIFIER LETTER ACUTE ACCENT third tone U+02C7 CARON fourth tone U+02CB MODIFIER LETTER GRAVE ACCENT light tone U+02D9 DOT ABOVE

Standard Mandarin Bopomofo. The order of the Mandarin Bopomofo letters U+3105.. U+3129 is standard worldwide. The code offset of the first letter U+3105  -   from a multiple of 16 is included to match the offset in the ISO-registered standard GB 2312. The character U+3127    is usually written as a vertical stroke when Bopomofo text is set vertically. In the Unicode Standard, this representation is con- sidered to be a rendering variation; the variant is not assigned a separate character code. Extended Bopomofo. To represent the sounds of Chinese dialects other than Mandarin, the basic Bopomofo set U+3105..U+3129 has been augmented by additional phonetic charac- ters. These extensions are much less broadly recognized than the basic Mandarin set. The three extended Bopomofo characters U+312A..U+312C are cited in some standard refer- ence works, such as the encyclopedia Xin Ci Hai. Another set of 24 extended Bopomofo, encoded at U+31A0..U+31B7, was designed in 1948 to cover additional sounds of the Min- nan and Hakka dialects. The extensions are used together with the main set of Bopomofo characters to provide a complete phonetic orthography for those dialects. There are no standard Bopomofo letters for the phonetics of Cantonese or several other Southern Chi- nese dialects. The small characters encoded at U+31B4..U+31B7 represent syllable-final consonants not present in standard Mandarin or Mandarin dialects. They have the same shapes as Bopo- mofo “p”, “t”, “k”, and “h”, respectively, but are rendered smaller than the initial conso- nants; they are also generally shown close to the syllable medial vowel character. These final letters are encoded separately so that Minnan and Hakka dialects can be represented unambiguously in plain text without having to resort to subscripting or other fancy text mechanisms to represent the final consonants.

278 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.5 Bopomofo

Extended Bopomofo Tone Marks. In addition to the Mandarin tone marks enumerated in Table 10-7, additional tone marks appropriate for use with the extended Bopomofo tran- scriptions of Minnan and Hakka can be found in the Modifier Letter range, as shown in Table 10-8. The “departing tone” refers to the qusheng in traditional Chinese tonal analysis, with the yin variant historically derived from voiceless initials and the yang variant from voiced initials. Southern Chinese dialects in general maintain more tonal distinctions than Mandarin. Table 10-8. Minnan and Hakka Tone Marks yin departing tone U+02EA YIN DEPARTING TONE MARK yang departing tone U+02EB YANG DEPARTING TONE MARK

Rendering of Bopomofo. Bopomofo is rendered left to right in horizontal text, but also commonly appears in vertical text. It may be used by itself in either orientation, but most commonly appears in interlinear annotation of Chinese (Han character) text. It is not uncommon for children’s books to be completely annotated with Bopomofo pronuncia- tions for every character. This interlinear annotation is structurally quite similar to the sys- tem of Japanese ruby annotation, but it has additional complications that result from the explicit usage of tone marks with the Bopomofo letters. In horizontal interlineation, the Bopomofo is generally placed above the corresponding Han character(s); tone marks, if present, appear at the end of each syllabic group of Bopo- mofo letters. In vertical interlineation, the Bopomofo is generally placed on the right side of the corresponding Han character(s); tone marks, if present, appear in a separate interlin- ear row to the right side of the vowel letter. When using extended Bopomofo for Minnan and Hakka, the tone marks may also be mixed with Latin digits 0–9 to express changes in actual tonetic values resulting from juxtaposition of basic tones.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 279 10.6 Yi East Asian Scripts

10.6 Yi

Yi: U+A000–U+A4CF The Yi syllabary is used to write the Yi language, a member of the Sino-Tibetan language family. The script is also known as Cuan or Wei. The Yi, also known as Lolo and Nuo-su, are one of the largest non-Han minorities in the People’s Republic of China (PRC). Most live in southwestern China, but others live in Myanmar, , and Vietnam. Yi is one of the official languages of PRC. The earliest surviving samples of classical Yi, an ideographic script, date from about 500 years ago. Unlike other Sinoform scripts, the ideographs themselves appear not to be derived from Han ideographs. There are some 8,000 to 10,000 characters in the classical , although the exact ideographs used varied from region to region. To improve literacy in Yi, the Yi syllabary was introduced in the 1970s. This syllabary is encoded in the Unicode Standard; the classical ideographic Yi script is not encoded at this time. Each Yi syllable consists of a consonantal initial, a final, and a tone. The core Yi syllabary consists of 819 signs for syllables with the first three tones (high, low, and middle low), plus a mark added to the form for the middle low tone to indicate a fourth tone (middle high). Standards. In 1991, a national standard for Yi was adopted by the PRC as GB 13134-91. This encoding includes all 1,165 Yi syllables and is the basis for the encoding used by the Unicode Standard, which includes all 1,165 Yi syllables. Naming Conventions and Order. The Yi syllables are named on the basis of their roman- ized sound values. The tone is indicated by appending a letter to the romanization: “t” for the high tone, “p” for the low tone, “x” for the middle high tone, and no letter for the mid- dle low tone. Rendering. Yi follows the writing rules for Han ideographs. Characters are generally writ- ten left to right or occasionally top to bottom. There is no typographic interaction between individual characters of the Yi script. Yi Radicals. To facilitate the lookup of Yi characters in dictionaries, a set of radicals has been invented. The Yi repertoire is divided into several subsets, each of which shares a com- mon stroke (radical). The name used for the radical is that of the corresponding Yi charac- ter closest to it in shape, with a “b” added as a suffix.

280 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 11 Additional Scripts 11

This chapter contains a collection of additional scripts that do not fit well into the script categories featured in other chapters: •Ethiopic •Cherokee • Canadian Aboriginal Syllabics •Mongolian Although the Mongolian and Ethiopic scripts can be traced to roots in the original Semitic writing systems, they would not be classified as Middle Eastern scripts today. The Cherokee and Canadian Aboriginal Syllabic scripts have roots in Latin letterforms and shorthand, but their application and evolution is a native North American development.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 283 11.1 Ethiopic Additional Scripts

11.1 Ethiopic

Ethiopic: U+1200–U+137F The Ethiopic syllabary originally evolved for writing the Semitic language Ge’ez, and indeed the English noun “Ethiopic” simply means “the Ge’ez language.” Ge’ez itself is now limited to liturgical usage, but its script has been adopted for modern use in writing several languages of central east Africa, including , Tigre, and Oromo. Basic and Extended Ethiopic. The Ethiopic characters encoded here are the basic set that has become established in common usage for writing major languages. As with other pro- ductive scripts, the basic Ethiopic forms are sometimes modified to produce an extended range of characters for writing additional languages. Research is ongoing to identify these extended Ethiopic forms, though many are rare and have scant typographic tradition. Encoding Principles. The syllables of the Ethiopic script are traditionally presented as a two-dimensional matrix of consonant-vowel combinations. The encoding follows this structure; in particular, the codespace range U+1200..U+1357 is interpreted as a matrix of 43 consonants crossed with 8 vowels, making 344 conceptual syllables. Most of these con- sonant-vowel syllables are represented by characters in the script, but some of them happen to be unused, accounting for the blank cells in the matrix. Variant Glyph Forms. A given Ethiopic syllable may be represented by different glyph forms, analogous to the glyph variants of Latin lowercase “a” or “g”, which do not coexist in the same font. Thus the particular glyph shown in the code chart for each position in the matrix is merely one representation of that conceptual syllable, and the glyph itself is not the object that is encoded. Labialized Subseries. A few Ethiopic consonants have labialized (“W”) forms that are tra- ditionally allotted their own consonant series in the syllable matrix, although only a subset of the possible voweled forms are realized. Each of these derivative series is encoded imme- diately after the corresponding main consonant series. Because the standard vowel series includes both “AA” and “WAA”, two different cells might represent the “consonant + W + AA” syllable. For example: U+1247 = Q + WAA: unused version of  U+124B = QW + AA:    In these cases, where the two conceptual syllables are equivalent, the entry in the labialized subseries is encoded and not the “consonant + WAA” entry in the main syllable series. The six specific cases are enumerated in Table 11-1. Table 11-1. Labialized Forms in -WAA -WAA Form Encoded as Not Used QWAA U+124B 1247 QHWAA U+125B 1257 XWAA U+128B 1287 KWAA U+12B3 12AF KXWAA U+12C3 12BF GWAA U+1313 130F

Also, within the labialized subseries, the sixth vowel (“-E”) forms are sometimes considered to be second vowel (“-U”) forms. For example:

284 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Additional Scripts 11.1 Ethiopic

U+1249 = QW + U: unused version of  U+124D = QW + E:    In these cases, where the two syllables are nearly equivalent, the “-E” entry is encoded and not the “-U” entry. The six specific cases are enumerated in Table 11-2. Table 11-2. Labialized Forms in -WE “-WE” Form Encoded as Not Used QWE U+124D 1249 QHWE U+125D 1259 XWE U+128D 1289 KWE U+12B5 12B1 KXWE U+12C5 12C1 GWE U+1315 1311

Keyboard Input. Because the Ethiopic script includes more than 300 characters, the units of keyboard input must constitute some smaller set of entities, typically 43+8 codes inter- preted as the coordinates of the syllable matrix. Because these keyboard input codes are expected to be transient entities that are resolved into syllabic characters before they enter stored text, keyboard input codes are not specified in this standard. Syllable Names. The Ethiopic script often has multiple syllables corresponding to the same Latin letter, making it difficult to assign unique Latin names. Therefore the names list makes use of certain devices (such as doubling a Latin letter in the name) merely to create uniqueness; this device has no relation to the phonetics of these syllables in any particular language. Encoding Order and Sorting. The order of the consonants in the encoding is based on the traditional alphabetical order. It may differ from the sort order used for one or another lan- guage, if only because in many languages various pairs or triplets of syllables are treated as equivalent in the first sorting pass. For example, an Amharic dictionary may start out with a section headed by three H-like syllables: U+1200    U+1210    U+1280    Thus the encoding order cannot and does not implement a collation procedure for any particular language using this script. Word Separators. The traditional word separator is U+1361   ( : ). In modern usage, a plain white wordspace (U+0020 ) is becoming common. Diacritical Marks. The Ethiopic script generally makes no use of diacritical marks, but they are sometimes employed for scholarly or didactic purposes. In particular, U+0308   and U+030E      are some- times used to indicate emphasis or gemination (consonant doubling). Numbers. The Ethiopic number system does not use a zero, nor is it based on digital-posi- tional notation. A number is denoted as a sequence of powers of 100, each preceded by a coefficient (2 through 99). In each term of the series, the power 100^n is indicated by n HUNDRED characters (merged to a digraph when n = 2). The coefficient is indicated by a tens digit and a ones digit, either of which is absent if its value is zero.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 285 11.1 Ethiopic Additional Scripts

For example, the number 2345 is represented by 2345 = (20 + 3)*100^1 + (40 + 5)*100^0 = 20 3 100 40 5 = TWENTY THREE HUNDRED FORTY FIVE = 1373 136B 137B 1375 136D A language using the Ethiopic script may have a word for “thousand,” such as Amharic “” (U+123A), and a quantity such as 2,345 may also be written as it is spoken in that language, which in the case of Amharic happens to parallel English: 2,345 = TWO thousand THREE HUNDRED FORTY FIVE = 136A 123A 136B 137B 1375 136D

286 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Additional Scripts 11.2 Cherokee

11.2 Cherokee

Cherokee: U+13A0–U+13FF The Cherokee script is used to write the . Cherokee is a member of the Iroquioan language family. It is related to Cayuga, Seneca, Onondaga, Wyandot-Huron, Tuscarora, Oneida, and Mohawk. The relationship is not close because roughly 3,000 years ago the Cherokees migrated southeastward from the Great Lakes region of North America to what is now North Carolina, Tennessee, and . Cherokee is the native tongue of approximately 20,000 people, although most speakers today use it as a second language. The Cherokee word for both the language and the people is “Tsalagi.” The , as invented by Sequoyah between 1815 and 1821, contained 6 vow- els and 17 consonants. Sequoyah avoided copying from other alphabets, but his original letters were modified to make them easier to print. The first font for Cherokee was designed by Dr. Samuel A. Worcester. Using fonts available to him, he assigned a number of Latin letters to the Cherokee syllables. At this time the Cherokee letter “HV” was dropped, and the Cherokee syllabary reached its current size of 85 letters. Dr. Worcester’s press printed 13,980,000 pages of Native American-language text, most of it in Cherokee. Tones. Each Cherokee syllable can be spoken on one of four pitch or tone levels, or can slide from one pitch to one or two others within the same syllable. However, only in certain words does the tone of a syllable change the meaning. Tones are unmarked. Case and Spelling. The Cherokee script is caseless, although for purposes of emphasis occasionally one letter will be made larger than the others. Cherokee spelling is not stan- dardized: each person spells as the word sounds to him or her. Numbers. Although Sequoyah invented a Cherokee number system, it was not adopted and is not encoded here. The Cherokee Nation uses European numbers. Cherokee speakers pay careful attention to the use of ordinal and cardinal numbers. When speaking of a num- bered series, they will use ordinals. For example, when numbering chapters in a book, Cherokee headings would use First Chapter, Second Chapter, and so on, instead of Chapter One, Chapter Two,... . Rendering and Input. Cherokee is a left-to-right script, which requires no combining characters. Several keyboarding conventions exist for inputting Cherokee. Some involve dead-key input based on Latin transliterations; some are based on sound-mnemonics related to Latin letters on keyboards; and some are ergonomic systems based on frequency of the syllables in the Cherokee language. Punctuation. Cherokee uses standard Latin punctuation. Standards. There are no other encoding standards for Cherokee.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 287 11.3 Canadian Aboriginal Syllabics Additional Scripts

11.3 Canadian Aboriginal Syllabics

Canadian Aboriginal Syllabics: U+1400–U+167F The characters in this block are a unification of various local syllabaries of Canada into a single repertoire based on character appearance. The syllabics were invented in the late 1830s by James Evans for Algonquian languages. As other communities and linguistic groups adopted the script, the main structural principles described below were adopted. The primary user community for this script consists of several aboriginal groups through- out Canada, including Algonquian, , and Athapascan language families. The script is also used by governmental agencies and in business, education, and media. Organization. The repertoire is organized primarily on structural principles found in the CASEC [1994] report, and is essentially a glyphic encoding. The canonical structure of each character series consists of a consonant shape with five variants. Typically the shape points down when the consonant is combined with the vowel /e/, up when combined with the vowel /i/, right when combined with the vowel /o/, and left when combined with the vowel /a/. It is reduced and superscripted when in syllable-final position, not followed by a vowel. For example:

”–˜® PE PI PO PA P

Some variations in vowels also occur. For example, in Inuktitut usage, the syllable U+1450 µ    is transcribed into Latin letters as “TU” rather than “TO”, but the structure of the syllabary is generally the same regardless of language. Arrangement. The arrangement of signs follows the Algonquian ordering (down-pointing, up-pointing, right-pointing, left-pointing), as in the previous example. Sorted within each series are the variant forms of each character series. Algonquian vari- ants appear first, then Inuktitut variants, then Athapascan variants. This arrangement is convenient and consistent with the historical diffusion of Syllabics writing; it does not imply any hierarchy. Some glyphs do not show the same down/up/right/left directions in the typical fashion— for example, beginning with U+146B Ð   . These glyphs are varia- tions of the rule because of the shape of the basic glyph; they do not affect the convention. and modify the character series through the addition of various marks (for example, U+143E £   ). Such modified characters are considered unique syllables. They are not decomposed into base characters and one or more diacritics. Some language families have different conventions for placement of the modifying mark. For the sake of consistency and simplicity, and to support multiple North American languages in the same document, each of these variants is assigned a unique code value.

288 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Additional Scripts 11.4 Mongolian

11.4 Mongolian

Mongolian: U+1800–U+18AF The Mongolians are key representatives of a cultural-linguistic group known as Altaic, after the Altai mountains of . In the past, these peoples have dominated the vast expanses of Asia and beyond, from the Baltic to the Sea of Japan. Echoes of remain from Finland, , and Turkey, across central Asia, to Korea and Japan. Today the Mongolians themselves are represented politically in Mongolia proper (formerly the Mongolian People’s Republic, also known as Outer Mongolia) and Inner Mongolia (for- mally the Inner Mongolia Autonomous Region, China), with Mongolian populations also living in other areas of China. The traditional encoded here has been in continuous use since the times of Genghis Khan. In the Mongolian People’s Republic, it was replaced in general use by a Cyrillic orthography in the early 1940s, but the traditional script was restored by law in 1992. As a practical matter, both traditional and Cyrillic texts are still seen in Mongolia and in Western scholarship. There is no one-to-one transcription between the two scripts; approximate correspondence mappings are indicated in the Mongolian character names list, but are not necessarily unique in either direction. All of the Cyrillic characters needed to write Mongolian are included in the Cyrillic section of the Unicode Standard. The traditional Mongolian script is used to write a Mongolian literary language of classical origin, which is distinct from spoken Mongolian dialects. In addition, the Manchu, Sibe, and formed their own writing systems based on the Mongolian script. To convey Buddhist classics accurately, the Ali Gali letters were added to traditional Mongolian, Todo, and Manchu for the transcription of Tibetan and Sanskrit. Directionality. The Mongolian script was derived around the twelfth century from the Uighur script, which originated from Aramaic, a right-to-left Semitic script. At some point the writing was transformed as though the whole page had been rotated 90 degrees coun- terclockwise, with the result that Mongolian is traditionally written vertically top to bottom in columns advancing from left to right. This directional pattern is unique among scripts for living languages. In modern contexts, the Mongolian script frequently occurs together with scripts of differ- ent directionality. Ideally, each script should appear separately in its own orientation, but in cases where this separation cannot be achieved, one of the scripts can be adopted into the directionality of the other. If the Mongolian script is adopted into horizontal text, its lines are rotated another 90 degrees counterclockwise so that the letters join left to right, and the columns are tran- scribed to the equivalent lines (first column becomes first line, and so on). If such text is viewed sideways, the usual Mongolian column order appears reversed, but this orientation can be workable for short stretches of text. Note that there are no bidirectional effects in such a layout, because all text is horizontal left to right. Standards. The encoding of Mongolian presented here has been cooperatively developed over years of careful research by a group of experts from Mongolia, China, and the West. These authorities are to publish a document called “Users’ Convention for System Imple- mentation of the International Standard on Mongolian Encoding,” explicitly defining Mongolian character shaping behavior in full. The complex sequence and shaping rules detailed in that document are only summarized in the following description.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 289 11.4 Mongolian Additional Scripts

Encoding Principle. The relationship between language and script in Mongolian is in some ways reminiscent of that in English: the same letter may represent different sounds; differ- ent letters may represent the same sound; and sometimes spellings are purely historical with little relation to modern pronunciation. Additionally, the Mongolian script is in some ways reminiscent of Arabic: letters usually join cursively together, assuming contextual forms according to a variety of rules, sometimes having up to 10 different presentation forms for the same letter. Despite these variations, the basic encoding principle for Mongolian is simple—the same principle as that for English and Arabic: there is a basic underlying alphabet whose letters are well agreed upon by the user community, apart from any question of their sounds or forms. The elements that are encoded are just the letters of the conventional Mongolian, Todo, Sibe, and Manchu alphabets, along with their associated numerals and punctuation. Punctuation. Punctuation symbols specific to Mongolian include U+1800   and U+1805   , which are used to mark, respectively, the beginning and end of a unit of text, such as a section or paragraph. In Mongolian Todo text, U+1806     is used at the beginning of the second line to indicate resumption of the broken word. It functions like a U+00AD  , except that this version appears at the beginning of a line rather than at the end. Some punctua- tion symbols used in Mongolian are coded in the General Punctuation block, such as U+2048 “?!”    and U+2049 “!?”   , used for side-by-side display in vertical text. Mongolian employs the usual U+0020  to separate words, plus distinctive narrow gap spaces within words, as discussed below. Letterforms. Mongolian is a cursive script, so the visual form of each letter generally depends on its position within a word. There are forms for the beginning (initial form), the middle (medial form), and the end of a word (final form). Vowels have an additional iso- lated form, used when the vowel appears between whitespaces. The positional forms are not all unique: in some cases, a character can have the same form in different positions (for example, the initial and medial forms of U+182A    are the same); in other cases, two different characters can look the same (for example, the isolated form of U+1824    is the same as that of U+1823   ). Besides positional forms, many letters also have free variants. These variants are not deter- mined by the position of a character within a word, but are prescribed by rules of spelling and grammar. Mongolian also employs some ligatures, particularly those formed by a vowel following a “bowed” consonant, which is a consonant like U+182A  -   that lacks a trailing vertical stem. The representative glyph in the code charts is generally the isolated form for the vowels and the initial form for the consonants, with the most common variant being chosen in case of alternatives. Other forms are occasionally chosen to avoid having different letters with the same glyph in the code charts: for example, U+1824    is represented by its initial form, because its isolated form is the same as that of U+1823  -  . Shaping Format Characters. For cases in which the contextual sequence of basic letters is not sufficient for a rendering engine to uniquely determine the appropriate glyph for a par- ticular letter, additional format characters are provided so that the typist may specify the desired rendering. There are seven characters that may function as glyph shape format selectors when inserted in a basic encoded Mongolian text sequence: U+180B      U+180C     

290 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Additional Scripts 11.4 Mongolian

U+180D      U+180E    U+200C   - U+200D    U+202F  -  Except for U+202F  -  and U+180E   , these characters normally have no visual appearance. Their sole purpose is to guide the rendering process in selecting the appropriate glyphs to represent base Mongolian letters in a particular context. A fully detailed specification of the shape selection rules is under development. The following are summaries of the effects of the shaping format characters. U+200C   - and U+200D    play the same role in Mongolian as in other scripts (see Chapter 13, Special Areas and Format Characters). Basi- cally, they evoke the same contextual selection effects in neighboring letters as do non-join- ing or joining regular letters, but are themselves invisible. U+202F  - : Many Mongolian words are formed by the addition of one or more grammatical suffixes to a basic stem word. The stem together with its suffixes is considered to be a single word, but each suffix is separated from the stem or from the pre- ceding suffix by a small gap that breaks the visual connectivity of the word. This word- internal whitespace is coded by U+202F  - . Because it represents a grammatical juncture, this character also functions as a variant selector that affects the contextual forms of the letters preceding and following it. Because it uses regular U+0020  between words, Mongolian may also use regular U+00A0 -  for its typical functions, such as preventing a line break between a pair of words. U+00A0 -  does not serve the function of U+202F  - . U+180E    is a word-internal thin whitespace that may occur only before the word-final vowels U+1820    and U+1821   . It determines the specific form of the character preceding it, selects a special vari- ant shape of these vowels, and produces a small gap within the word. If it erroneously occurs before any character other than “A” or “E”, it behaves like a U+202F  -  . The     characters are used to distinguish different variants of the same letter appearing under the same conditions—that is, where more than one rendered shape is possible and the selection must be indicated by human intervention rather than derived by algorithm. A free variant selector is encoded after the base character it modifies. Free variant selectors are not productive and are therefore ignored when not immediately preceded by one of their listed base characters. Tables specifying the precise shape-selection behavior of these characters are in the midst of the standardization process; the details will be published as a Unicode Technical Report and posted on the Unicode Web site when they become available. Conformant usage of these characters requires adherence to the shape-selection tables.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 291 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 12 Symbols 12

The universe of symbols is rich and open-ended. The collection of encoded symbols in the Unicode Standard encompasses the following: •Currency Symbols • Letterlike Symbols •Number Forms • Mathematical Operators and Arrows •Technical Symbols •Geometrical Symbols • Miscellaneous Symbols and Dingbats • Enclosed and Square • There are other notational systems not covered by the Unicode Standard. Some symbols mark the transition between pictorial items and text elements; because they do not have a well-defined place in plain text, they are not encoded here. Combining marks may be used with symbols, particularly the set encoded at U+20D0.. U+20FF (see Section 7.9, Combining Marks). Letterlike and currency symbols, as well as number forms including superscripts and sub- scripts are typically subject to the same font and style changes as the surrounding text. Some, but not all, of the square and enclosed symbols occur in East Asian contexts and gen- erally follow the prevailing type styles. Other symbols have an appearance that is independent of type style, or a more limited or altogether different range of type style variation than the regular text surrounding them. Symbols such as mathematical operators can be used with any script, or independent of any script. In a bidirectional context (see Section 3.12, Bidirectional Behavior), symbol characters have no inherent directionality, but resolve according to the Unicode bidirectional algorithm. Where the image of a symbol is not bilaterally symmetric, the mirror image is used when the character is part of the right-to-left text stream (see Section 4.7, Mirrored—Normative). Dingbats and optical character recognition characters are different from all other charac- ters in the standard in that they are encoded based on their precise appearance. Braille patterns are a special case, because they can be used to write text. They are included as symbols, as the Unicode Standard encodes only their shapes; the association of letters to patterns is left to other standards. When a character stream is intended primarily to convey text information, it should be coded using one of the scripts. Only when it is intended to

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 295 Symbols convey a particular binding of text to Braille pattern sequence should it be coded using the Braille patterns. Many symbols encoded in the Unicode Standard are obsolescent, and intended to support legacy implementations, such as terminal emulation or other charcter mode user inter- faces. Examples include box drawing components and control pictures.

296 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.1 Currency Symbols

12.1 Currency Symbols

Currency Symbols: U+20A0–U+20CF This block contains currency symbols not encoded in other blocks. Where the Unicode Standard follows the layout of an existing standard, such as for the ASCII, Latin 1, and Thai blocks, the currency symbols are encoded in those blocks, rather than here. Unification. The Unicode Standard does not duplicate encodings where more than one currency is expressed with the same symbol. Many currency symbols are overstruck letters. There are therefore many minor variants, such as the U+0024   $, with one or two vertical bars, or other graphical variation. The Unicode Standard considers these vari- ants to be typographical and provides a single encoding. Claims that glyph variants of a certain are used consistently to indicate a particular currency could not be substantiated upon further research. See ISO/IEC DIS 10367, Annex B (informative) for an example of multiple renderings for U+00A3  . Fonts. Currency symbols are commonly designed to be at digit width. Like letters, they fol- low the style of the font. Table 12-1 lists common currency symbols encoded in other blocks. Table 12-1. Other Currency Symbols Dollar, milreis, escudo U+0024 DOLLAR SIGN U+00A2 CENT SIGN Pound U+00A3 General currency U+00A4 CURRENCY SIGN Yen or yuan U+00A5 YEN SIGN Dutch florin U+0192 LATIN SMALL LETTER F WITH HOOK Baht U+0E3F THAI CURRENCY SYMBOL BAHT Riel U+17DB KHMER CURRENCY SYMBOL RIEL

For additional forms of currency symbols, see fullwidth forms (U+FFE0..U+FFE6). . The new single currency for member countries of the European Monetary Union (EMU) is the euro. The euro character is encoded in the Unicode Standard as U+20AC  .

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 297 12.2 Letterlike Symbols Symbols

12.2 Letterlike Symbols

Letterlike Symbols: U+2100–U+214F Letterlike symbols are symbols derived in some way from ordinary letters of an alphabetic script. This block includes symbols based on Latin, Greek, and Hebrew letters. Many of these symbols are encoded for compatibility. In general, the usage of distinct codes for let- terlike symbols that are merely font variants or alternative representations of other charac- ter sequences is strongly discouraged. When using letters as symbols in equations and formulae, as well as in other contexts, use normal alphabetic forms in the appropriate styles. For example, to represent degrees “°C”, use a sequence of U+00B0   + U+0043    , rather than U+2103  . For search- ing, treat these two sequences as identical. Despite its name, U+2118    is neither script nor capital—it is uniquely the Weierstrass elliptic function derived from a calligraphic lowercase p. U+2116   is provided both for Cyrillic use, where it looks like , and for com- patibility with Asian standards, where it looks like ï. The French practice is to use “N” fol- lowed by the degree sign: N°. In the context of East Asian typography, letterlike symbols are rendered as “wide” charac- ters occupying a full cell. They remain upright in vertical layout, contrary to the rotated rendering of their regular letter equivalents. Where the letterlike symbols have alphabetic equivalents, they collate in alphabetic sequence; otherwise, they should be treated as neutral symbols. The letterlike symbols may have different directional properties than normal letters; for example, the four transfinite cardinal symbols (U+2135..U+2138) are used in ordinary mathematical text and do not share the strong right-to-left directionality of the Hebrew letters from which they are derived. Styles. The letterlike symbols include some of the few instances in which the Unicode Stan- dard encodes stylistic variants of letters as distinct characters. For example, there are instances of black letter, double-struck, and script styles for certain Latin letters used as mathematical symbols. The choice of these stylistic variants for encoding reflects their common use as distinct symbols. It is recognized that a particular style can be applied to any Latin letter with a resulting semantic distinction in mathematical or logical text; appli- cations that require such systematic stylistic semantics should achieve them by using styles directly, rather than by seeking to extend the character-by-character encoding of such vari- ants in the Unicode Standard. The black letter style is often referred to as or Gothic in various sources. Technically, Fraktur and Gothic typefaces are distinct designs from black letter, but no encoding dis- tinctions are implied in the various symbol sources. The Unicode Standard simply uses black letter forms as the archetypes. A similar consideration applies to the double-struck style. This style is not literally double- struck, but is instead an open outline design that gives the visual appearance of being struck twice with a horizontal shift. For encoding purposes, this style can be considered equivalent to letterlike symbols rendered in outlined or shadowed typefaces to carry con- ventional semantic distinctions. Standards. The Unicode Standard encodes letterlike symbols from many different national standards and corporate collections.

298 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.3 Number Forms

12.3 Number Forms

Number Forms: U+2150–U+218F Number form characters are encoded solely for compatibility with existing standards. The same considerations with respect to compatibility apply as noted in the discussion of letter- like symbols. Fractions. The vulgar fraction characters encoded in this block can be equivalently repre- sented using U+2044  . Roman Numerals. The Roman numerals can be composed of sequences of the appropriate Latin letters. Upper- and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded for compatibility with East Asian standards. U+2180       and U+216F    -  can be considered to be glyphic variants of the same Roman numeral, but are distin- guished because they are not generally interchangeable, and because U+2180 cannot be considered to be a compatibility equivalent to the Latin letter M. U+2181     and U+2182     are distinct characters used in Roman numerals; they do not have compatibility decompositions in the Unicode Stan- dard. U+2183      is a form used in combinations with C and/or I to form large numbers—some of which vary with single character number forms such as D, M, U+2181, or others. For other number forms, see the Hangzhou numerals (U+3021..U+3029, U+3038.. U+303A) and the fractions in the Latin-1 block (U+00BC..U+00BE).

Superscripts and Subscripts: U+2070–U+209F Superscripts and subscripts have been included in the Unicode Standard only to provide compatibility with existing character sets. In general, the Unicode character encoding does not attempt to describe the positioning of a character above or below the baseline in typo- graphical layout. The superscript digits one, two, and three are coded in the Latin-1 Sup- plement block to provide code point compatibility with ISO 8859-1. Standards. The characters in this block are from sets registered with ECMA under ISO 2374 for use with ISO 2022.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 299 12.4 Mathematical Operators Symbols

12.4 Mathematical Operators

Mathematical Operators: U+2200–U+22FF The Mathematical Operators block includes character encodings for operators, relations, geometric symbols, and a few other symbols with special usages confined largely to mathe- matical contexts. In addition to the characters in this block, mathematical operators are also found in the Basic Latin (ASCII) and Latin-1 Supplement blocks. A few of the symbols from the Miscel- laneous Technical block and characters from General Punctuation are also used in mathe- matical notation. Latin letters in special font styles that are used as mathematical operators, such as U+210B ü    , as well as the Hebrew letter alef used as the operator first transfinite cardinal encoded by U+2135 ℵ  , are encoded in the block for letterlike symbols. Standards. Many national standards’ mathematical operators are covered by the characters encoded in this block. These standards include such special collections as ANSI Y10.20, ISO 6862, ISO 8879, and portions of the collection of the American Mathematical Society, as well as the original repertoire of TEX. Encoding Principles. Mathematical operators often have more than one meaning. There- fore the encoding of this block is intentionally rather shape-based, with numerous instances in which several semantic values can be attributed to the same Unicode value. For example, U+2218 °   may be the equivalent of white small circle or composite function or apl jot. The Unicode Standard does not attempt to distinguish all possible semantic values that may be applied to mathematical operators or relation symbols. On the other hand, mathematical operators, and especially relation symbols, may appear in various standards, handbooks, and fonts with a large number of purely graphical variants. Where variants were recognizable as such from the sources, they were not encoded sepa- rately. Unifications. Mathematical operators such as implies ⇒ and ↔ have been unified with the corresponding arrows (U+21D2    and U+2194   , respectively) in the Arrows block. The operator U+2208   is occasionally rendered with a taller shape than shown in the code charts. Mathematical handbooks and standards consulted treat these characters as variants of the same glyph. U+220A    is a distinctively small version of the element of that originates in mathematical pi fonts. The operators U+226B  - and U+226A  - are some- times rendered in a nested shape. Because no semantic distinction applies, the Unicode Standard provides a single encoding for each operator. A large class of unifications applies to variants of relation symbols involving equality, simi- larity, and/or negation. Variants involving one- or two-barred equal signs, one- or two-tilde similarity signs, and vertical or slanted negation slashes and negation slashes of different lengths are not separately encoded. Thus, for example, U+2288       , is the archetype for at least six different glyph variants noted in various collec- tions. In two instances, essentially stylistic variants are separately encoded: U+2265 -     is distinguished from U+2267 -   ; the same distinction applies to U+2264 -    and U+2266 - 

300 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.4 Mathematical Operators

 . This exception to the general rule regarding variation results from requirements for character mapping to some Asian standards that distinguish the two forms. Greek-Derived Symbols. Several mathematical operators derived from Greek characters have been given separate encodings to match usage in existing standards. These operators may occasionally occur in context with Greek-letter variables. They include U+2206 ∆  , U+220F ‚ - , and U+2211 ∑ - . Other duplicated Greek characters are those for U+00B5 µ   in the Latin-1 Sup- plement block, U+2126 Ω   in Letterlike Symbols, and several characters among the APL functional symbols in the block. All other Greek charac- ters with special mathematical semantics are found in the Greek block because duplicates were not required for compatibility. N-ary Operators. N-ary operators are distinguished from binary operators by their larger size and the fact that in mathematical layout, they take limit expressions. Miscellaneous Symbols. U+2212 –   is a mathematical operator, to be distin- guished from the ASCII-derived U+002D - -, which may look the same as a minus sign, or may be shorter in length. (For a complete list of dashes in the Unicode Stan- dard, see Table 6-2 .) U+22EE..U+22F1 are a set of ellipses used in matrix notation. Mathematical Property. A list of characters with the mathematical property is provided in Section 4.9, Mathematical Property.

Arrows: U+2190–U+21FF Arrows are used for a variety of purposes: to imply directional relation, to show logical der- ivation or implication, or to represent the cursor control keys. The Unicode Standard attempts to provide fairly complete encodings for generic arrow shapes, especially where there are established usages with well-defined semantics. It does not attempt to encode every possible stylistic variant of arrows separately, especially where their use is mainly decorative. For most arrow variants, the Unicode Standard provides encodings in the two horizontal directions, often in the four cardinal directions. For the single and double arrows, the Unicode Standard provides encodings in eight directions. Standards. The Unicode Standard encodes arrows from many different national standards and corporate collections. Unifications. Arrows expressing mathematical relations have been encoded in the arrows block. An example is U+21D2 ⇒   , which may be the equiva- lent of implies. Long and short arrow forms encoded in glyph standards or typesetting systems such as TEX are not represented by separate Unicode values. Encoding Principles. Because the arrows have such a wide variety of applications, there may be several semantic values for the same Unicode character value. For example, U+21B5 ↵      may be the equivalent of car- riage return; U+2191 ↑   may be the equivalent of increases or exponent.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 301 12.5 Technical Symbols Symbols

12.5 Technical Symbols

Control Pictures: U+2400–U+243F The need to show the presence of the C0 control codes and the  unequivocally when data are displayed has led to conventional representations for these nongraphic characters. By definition, control codes themselves are manifested only by their action. However, it is sometimes necessary to show the position of a control code within a data stream. Conven- tional illustrations for the ASCII C0 control codes have been developed. By definition, the  is a blank graphic. Conventions have also been established for the visible representation of the space. Standards. The CNS 11643 standard encodes characters for pictures of control codes. Standard representations for control characters have been defined—for example, in ANSI X3.32 and ISO 2047—but for the control code graphics U+2400..U+241F only the seman- tic is encoded in the Unicode Standard. This choice allows a particular application to use the graphic representation it prefers. Pictures for ASCII Space. Two specific glyphs are provided that may be used to represent the ASCII space character (U+2420 and U+2422). Code Points for Pictures for Control Codes. The remaining code points in this block are not associated with specific glyphs, but rather are available to encode any desired pictorial representation of the given control code. The assumption is that the particular pictures used to represent control codes are often specific to different systems, and are not often the subject of text interchange between systems.

Miscellaneous Technical: U+2300–U+23FF This block encodes technical symbols, including keytop labels such as U+232B    . Excluded from consideration were symbols that are not normally used in one- dimensional text but are intended for two-dimensional diagrammatic use, such as symbols for electronic circuits. An unusually large expansion space is provided because it is antici- pated that a large number of technical symbols could eventually be considered for addition to the Unicode Standard. Crops and Quine Corners. Crops and quine corners are most properly used in two-dimen- sional layout but may be referred to in plain text. The usage of crops and quine corners is as indicated in the following diagram:

Use of crops Use of quine corners APL Functional Symbols. APL (A Programming Language) makes extensive use of func- tional symbols constructed by composition with other, more primitive functional symbols. It made extensive use of and overstrike mechanisms in early computer imple- mentations. In principle, functional composition is productive in APL; in practice,

302 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.5 Technical Symbols however, a relatively small number of composed functional symbols have become standard operators in APL. This relatively small set is encoded in entirety in this block. All other APL extensions can be encoded by composition of other Unicode characters. For example, the APL symbol a underbar can be represented by U+0061     + U+0332   .

Optical Character Recognition: U+2440–U+245F This block includes those symbolic characters of the OCR-A character set that do not cor- respond to ASCII characters, and magnetic ink character recognition (MICR) symbols used in check processing. Standards. Both sets of symbols are specified in ISO 2033.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 303 12.6 Geometrical Symbols Symbols

12.6 Geometrical Symbols

Box Drawing: U+2500–U+257F The characters in the Box Drawing block are encoded solely to facilitate the support of leg- acy implementations, such as terminal emulation. Standards. GB 2312, KS C 5601, and industry standards were used to develop this block.

Block Elements: U+2580–U+259F The block represents a graphic compatibility zone in the Unicode Stan- dard. A number of existing national and vendor standards, including IBM PC , contain a number of characters intended to enable a simple kind of display cell graph- ics by filling some fraction of each cell, or by filling each display cell by some degree of shading. The Unicode Standard does not encourage this kind of character-based graphics model but includes a minimal set of such characters for backward compatibility with the existing standards. Half-block fill characters are included for each half of a display cell, plus a graduated series of vertical and horizontal fractional fills based on one-eighth parts. Also included is a series of shades based on one-quarter shadings. The fractional fills do not form a logically com- plete set but are intended only for backward compatibility.

Geometric Shapes: U+25A0–U+25FF The Geometric Shapes are a collection of characters intended to encode prototypes for var- ious commonly used geometrical shapes—mostly squares, triangles, and circles. The col- lection is somewhat arbitrary in scope; it is a compendium of shapes from various character and glyph standards. The typical distinctions more systematically encoded include black versus white, large versus small, basic shape (square versus triangle versus cir- cle), orientation, and top versus bottom or left versus right part. The hatched and cross-hatched squares at U+25A4..U+25A9 are derived from the Korean national standard (KS C 5601), in which they were probably intended as representations of fill patterns. Because the semantics of those characters is insufficiently defined in that stan- dard, the Unicode character encoding simply carries the glyphs themselves as geometric shapes to provide a mapping for the Korean standard. U+25CA ¯  is a typographical symbol seen in PostScript and in the Macintosh character set. It should be distinguished from both the generic U+25C7   and the U+2662   , as well as from another character sometimes called a lozenge, U+2311  . The squares and triangles at U+25E7..U+25EE are derived from the Linotype font collec- tion. U+25EF   is included for compatibility with the JIS X 0208-1990 Japa- nese standard. Standards. The Geometric Shapes are derived from a large range of national and vendor character standards.

304 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.7 Miscellaneous Symbols and Dingbats

12.7 Miscellaneous Symbols and Dingbats

Miscellaneous Symbols: U+2600–U+26FF The Miscellaneous Symbols block consists of a very heterogeneous collection of symbols that do not fit in any other Unicode character block and that tend to be rather pictographic in nature. These symbols are typically used for text decorations, but they may also be treated as normal text characters in applications such as typesetting chess books, card game manuals, and horoscopes. Characters in the Miscellaneous Symbols block may be rendered in more than one way, unlike characters in the Dingbats block, in which characters correspond to an explicit glyph. For example, both U+2641  and U+2645  have common alternative glyphs.  can be rendered as >or ?, and  can be rendered as < or = The order of the Miscellaneous Symbols is completely arbitrary, but an attempt has been made to keep like symbols together and to group subsets of them into meaningful orders. Some of these subsets include weather and astronomical symbols, pointing hands, religious and ideological symbols, the I Ching trigrams, planet and zodiacal symbols, chess pieces, card suits, and musical dingbats. (For other moon phases, see the circle-based shapes in the Geometric Shapes block.) Corporate logos and collections of pictures of animals, vehicles, foods, and so on are not included because they tend either to be very specific in usage (logos, political party sym- bols) or nonconventional in appearance and semantic interpretation (pictures of cows or of cats; fizzing champagne bottles), and hence are inappropriate for encoding as characters. The Unicode Standard recommends that such items be incorporated in text via higher pro- tocols that allow intermixing of graphic images with text, rather than by indefinite exten- sion of the number of Miscellaneous Symbols encoded as characters. However, a large unassigned space has been set aside in this block with the expectation that other conven- tional sets of such symbols may be found appropriate for character encoding in the future. Standards. The Miscellaneous Symbols are derived from a large range of national and ven- dor character standards.

Dingbats: U+2700–U+27BF The Dingbats are a well-established set of symbols comprising the industry standard “Zapf Dingbat” font—currently available in most laser printers. Other series of dingbats also exist but are not encoded in the Unicode Standard because they are not widely implemented in existing hardware and software as character-encoded fonts. Dingbats that are part of other standards have been encoded in the Geometrical Shapes, , and Miscellaneous Symbols blocks. The order of the remaining dingbats follows the PostScript encoding. The treatment of the Dingbats in the Unicode Standard differs from that of all other char- acters. These symbols are encoded as specific glyph shapes, rather than as glyphic arche- types for abstract characters that can be represented in different faces and styles. Thus, it would be incorrect to arbitrarily replace U+279D „ -   with any other right arrow dingbat or with any of the generic arrows from the Arrows block (U+2190..U+21FF). In other words, because the Zapf Dingbat font refers to glyphs from a specific typeface, their semantic value is their shape.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 305 12.7 Miscellaneous Symbols and Dingbats Symbols

Unifications. A number of the Dingbats represent shapes that overlap with regular Uni- code symbol characters. Instead of coding both a Zapf Dingbat glyph shape and a separate character whose glyphic representation is normally indistinguishable from that shape, the Unicode Standard unifies the two. The characters in question include card suits,  ,  , and  -  (see “Miscellaneous Sym- bols”);   and   (see “Geometric Shapes”); white encircled num- bers 1 to 10 (see “Enclosed Alphanumerics”); and several generic arrows (see “Arrows”). These four entries appear elsewhere in this section. The positions of these unified characters are left unassigned in the Dingbats block and are cross-referenced to the assigned positions in the other blocks. Applications may use alter- native glyphs for representing those characters (as for any normal Unicode characters), including, of course, the exact shapes required for rendering them in the Zapf Dingbat font on an imaging device. To illustrate this distinction, an application encoding an encircled digit one with U+2460 Å    may render that encircled digit in any appropriate typeface—serif or sans serif, roman or italic, and with the circle rendered in different thicknesses. On the other hand, an application encoding an encircled digit one with the Dingbat U+2780 þ      requires an explicit sans serif glyph from the Zapf Dingbat font for rendering.

306 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.8 Enclosed and Square

12.8 Enclosed and Square

Enclosed Alphanumerics: U+2460–U+24FF The enclosed numbers and Latin letters of this block come from several sources, chiefly East Asian standards, and are provided for compatibility with them. Standards. Enclosed letters and numbers occur in the Korean national standard, KS C 5601, and in the Chinese national standard, GB 2312, as well as in various East Asian indus- try standards. The Zapf Dingbat character set in widespread industry use contains four sets of encircled numbers (including encircled zero). The black-on-white set that has numbers with serifs is encoded here (U+2460..U+2468, and U+24EA). The other three sets are encoded in the range U+2776..U+2793 in the Dingbats block. Decompositions. The parenthesized letters or numbers may be decomposed to a sequence of opening parenthesis, letter or digit(s), closing parenthesis. The numbers with a period may be decomposed to digit(s), followed by a period. The encircled letters and single-digit numbers may be decomposed to a letter or digit followed by U+20DD  -  . Decompositions for the encircled numbers 10 through 20 are not supported in Unicode plain text. (For more information, see Chapter 2, General Structure, and Chapter 3, Conformance.)

Enclosed CJK Letters and Months: U+3200–U+32FF Standards. This block provides mapping for all the enclosed Hangul elements from Korean standard KS C 5601 as well as parenthesized ideographic characters from JIS X 0208-1990 standard, CNS 11643, and several corporate registries.

CJK Compatibility: U+3300–U+33FF CJK squared Katakana words are Katakana-spelled words that fill a single display cell (em- square) when intermixed with CJK ideographs. Likewise, squared Latin abbreviation sym- bols are designed to fill a single character position when mixed with CJK ideographs. These characters are provided solely for compatibility with existing character encoding standards. Modern software can supply an infinite repertoire of Kana-spelled words or squared abbreviations on the fly. Standards. CJK Compatibility characters are derived from the KS C 5601 and CNS 11643 national standards, and from various company registries. Japanese Era Names. The Japanese era names refer to the dates given in Table 12-2. Table 12-2. Japanese Era Names U+337B     1989-01-07 to present day U+337C     1926-12-24 to 1989-01-06 U+337D     1912-07-29 to 1926-12-23 U+337E     1867 to 1912-07-28

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 307 12.9 Braille Symbols

12.9 Braille

Braille: U+2800–U+28FF Braille is a writing system used by blind people worldwide. It uses a system of six or eight raised dots, arranged in two vertical rows of three or four dots respectively. Eight-dot sys- tems build on six-dot systems by adding two extra dots above or below the core matrix. Six- dot Braille allows 64 possible combinations, and eight-dot Braille allows 256 possible pat- terns of dot combinations. There is no fixed correspondence between a dot pattern and a character or symbol of any given script. Dot pattern assignments are dependent on context and user community. A single pattern can represent an abbreviation or a frequently occur- ring short word. For a number of contexts and user communities, the series of interna- tional standards starting with ISO 11548-1 provide standardized correspondence tables as well as invocation sequences to indicate a context switch. The Unicode Standard encodes a single complete set of 256 eight-dot patterns. This set includes the 64 dot patterns needed for six-dot Braille. The character names for Braille patterns are based on the assignments of the dots of the Braille pattern to digits 1 to 8 as follows:

1 ll 4 2 ll 5 3 ll 6 7 ll 8

The designation of dots 1 to 6 corresponds to that of 6-dot Braille. The additional dots 7 and 8 are added beneath. The character name for a Braille pattern consists of  -  -, where only those digits corresponding to dots in the pattern are included. The name for the empty pattern is   . The 256 Braille patterns are arranged in the same sequence as in ISO 11548-1, which is based on an number generated from the pattern arrangement. Octal numbers are associated with each dot of a Braille pattern in the following way:

1 ll 10 2 ll 20 4 ll 40 100 ll 200

The octal number is obtained by adding the values corresponding to the dots present in the pattern. Octal numbers smaller than 100 are expanded to three digits by inserting leading zeroes. For example, the dots of   - are assigned to the octal val- ues of 18, 28, 108, and 1008. The octal number representing the sum of these values is 1138. The assignment of meanings to Braille patterns is outside the scope of this standard. Example. According to ISO 11548-2, the character     can be repre- sented in eight-dot Braille by the combination of the dots 1, 2, 4, and 7 ( 

308 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Symbols 12.9 Braille

-). A full circle corresponds to a tangible (set) dot, and empty circles serve as posi- tion indicators for dots not set within the dot matrix:

1 ll 4 2 l ° 5 3 ° ° 6 7 l ° 8

Usage Model. The eight-dot Braille patterns in the Unicode Standard are intended to be used with either style of eight-dot Braille system, whether the additional two dots are con- sidered to be in the top row or in the bottom row. These two systems are never intermixed in the same context, so their distinction is a matter of convention. The intent of encoding the 256 Braille patterns in the Unicode Standard is to allow input and output devices to be implemented that can interchange Braille data without having to go through a context- dependent conversion from semantic values to patterns, or vice versa. In this manner, final form documents can be exchanged and faithfully rendered. On the other hand, processing of textual data that require semantic support is intended to take place using the regular character assignments in the Unicode Standard. Imaging. When output on a Braille device, dots shown as black are intended to be rendered as tangible. Dots shown in the standard as open circles are blank (not rendered as tangible). The Unicode Standard does not specify any physical dimension of Braille characters. In the absence of a higher-level protocol, Braille patterns are output from left to right. When used to render final form (tangible) documents, Braille patterns are normally not intermixed with any other Unicode characters except control codes.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 309 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 13 Special Areas and Format Characters 13

This chapter describes several kinds of characters that have special properties as well as areas of codespace that are set aside for special purposes: • Control Codes •Layout Controls • Deprecated Format Characters •Surrogates • Private Use Area • Special Characters In addition to regular characters, the Unicode Standard contains a number of characters that are not normally rendered directly, but that influence the layout of text or otherwise affect the operation of text processes. They are called format characters. The Unicode Standard contains code positions for the 64 control characters and the DEL character found in ISO standards and many vendor character sets. The choice of control function associated with a given character code is outside the scope of the Unicode Stan- dard, with the exception of those control characters specified in this chapter. Layout controls are not themselves rendered visibly, but influence the behavior of algo- rithms for line breaking, word breaking, glyph selection, and bidirectional ordering. Surrogate characters are reserved and are to be used in pairs to access 1,048,544 characters. Private use characters are reserved for private use. Their meaning is defined by private agreement. The Specials block contains characters that are neither graphic characters nor traditional controls.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 313 13.1 Control Codes Special Areas and Format Characters

13.1 Control Codes

C0 Control Codes: U+0000–U+001F ASCII C0 Control Codes and Delete. The Unicode Standard makes no specific use of these control codes, but it provides for the passage of the numeric code values intact, neither adding to nor subtracting from their semantics. The semantics of the C0 controls (U+0000..U+001F) and delete (U+007F) are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the semantics specified in ISO/IEC 6429. U+0009   is widely used with a consistent set of semantics; therefore these semantics are recog- nized in the Unicode Standard. (For more information on control codes, see Section 2.8, Controls and Control Sequences.) There is a simple one-to-one mapping between 7-bit (and 8-bit) control codes and Uni- code control codes: every 7-bit (or 8-bit) control code is simply zero-extended to a 16-bit code. For example, if   (0A) is to be used for terminal control, then the text “WXYZ” would be transmitted in plain Unicode text as the following 16-bit values: “0057 0058 000A 0059 005A.” Any interpretation of these control codes is outside the scope of the Unicode Standard; programmers should refer to a relevant standard (for example, ISO/IEC 6429) that specifies control code interpretations. Newline Function. One or more of the control codes U+000A  , U+000D -  , or the Unicode Equivalent of EBCDIC NL encode a newline function. A newline function can act like a line separator or a paragraph separator, depending on the application. See Section 13.2, Layout Controls, for information on how to interpret a line or paragraph separator. The exact encoding of a newline function depends on the application domain. For information on how to identify a newline function, see Unicode Technical Report #13, “Unicode Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.

C1 Control Codes: U+0080–U+009F In extending the 7-bit encoding system of ASCII to an 8-bit system, ISO/IEC 4873 (on which the 8859 family of character standards are based) introduced 32 additional control codes in the range 80–9F hex. Like the C0 control codes, the Unicode Standard makes no specific use of these C1 control codes, but provides for the passage of their numeric code values intact, neither adding to nor subtracting from their semantics. The semantics of the C1 controls (U+0080..U+009F) are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the semantics specified in ISO/IEC 6429.

314 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.2 Layout Controls

13.2 Layout Controls

Layout Controls The effect of layout controls is specific to particular text processes. As much as possible, lay- out controls are transparent to those text processes for which they were not intended. In other words, their effects are mutually orthogonal.

Line and Word Breaking U+00A0 -  has the same width as U+0020 , but the -  indicates that, under normal circumstances, no line-breaks are permitted between it and surrounding characters, unless the preceding or following character is a line or paragraph separator. U+00A0 -  behaves like the following coded character sequence: U+FEFF   -  + U+0020  + U+FEFF   -  . For a complete list of space characters in the Unicode Standard, see Table 6-1. U+00AD   indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line-break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible). Contrast this usage with U+2027 -  , which is used for a visible indication of the place of hyphenation in dictio- naries. For a complete list of dash characters, including all the hyphens, in the Unicode Standard, see Table 6-2. There are two nonbreaking hyphen characters in the Unicode Standard: U+2011 -   and U+0F0C     . See Section 9.13, Tibetan, for more discussion of the Tibetan-specific line-breaking behavior. Zero Width No-Break Space. In addition to the meaning of byte order mark, the code value U+FEFF possesses the semantics of   - . As   - , U+FEFF behaves like U+00A0 -  in that it indicates the absence of word boundaries; however, the former has no width. For example, this character can be inserted after the fourth character in the text “base+delta” to indicate that there should be no line-break between the “e” and the “+”. The   -  can be used to prevent line breaking with other characters that do not have non- breaking variants, such as U+2009   or U+2015  , by bracketing the character. Zero Width Space. The U+200B    indicates a word boundary, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent word breaks, such as in Thai or Japanese. When text is justified, ZWSP has no effect on letter spacing—for example, in English or Japanese usage. There may be circumstances with other scripts, such as Thai, where extra space is applied around ZWSP as a result of justification, as shown in Figure 13-1. This approach is unlike the use of fixed-width space characters, such as U+2002  , that have specified width

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 315 13.2 Layout Controls Special Areas and Format Characters and should not be automatically expanded during justification (see Section 6.1, General Punctuation). Figure 13-1. Letter Spacing

Type Justification Examples Explanation

Memory the ISP¨‹Charts The ‹ is inserted to allow linebreak after ¨ Display 1 the ISP¨Charts Without letter spacing

Display 2 the ISP¨ Charts Normal letter spacing

Display 3 the ISP¨ Charts “Thai-style” letter spacing

Display 4 the ISP¨ Charts ‹ incorrectly inhibiting letter spacing

Zero-Width Spaces and Joiner Characters. The zero-width spaces are not to be confused with zero-width joiner characters. U+200C   - and U+200D    have no effect on word boundaries, and   -  and    have no effect on joining or linking behavior. In other words, the zero- width joiner characters should be ignored when determining word boundaries;    should be ignored when determining cursive joining behavior. See Cursive Connection below. Line and Paragraph Separator. The Unicode Standard provides two unambiguous charac- ters, U+2028   and U+2029  , to separate lines and paragraphs. They are considered the canonical form of denoting line and paragraph boundaries in Unicode plain text. A new line is begun after each  . A new paragraph is begun after each  . As these characters are separator codes, it is not necessary either to start the first line or paragraph or to end the last line or paragraph with them. Doing so would indicate that there was an empty paragraph or line following. The   can be inserted between paragraphs of text. Its use allows the creation of plain text files, which can be laid out on a different line width at the receiving end. The   can be used to indicate an unconditional end of line. A paragraph separator indicates where a new paragraph should start. Any interparagraph formatting would be applied. This formatting could cause, for example, the line to be bro- ken, any interparagraph line spacing to be applied, and the first line to be indented. A line separator indicates that a line-break should occur at this point; although the text continues on the next line, it does not start a new paragraph: no interparagraph line spacing or para- graphic indentation is applied.

Cursive Connection The Non-joiner and Joiner. In some fonts for some scripts, consecutive characters in a text stream may be rendered via adjacent glyphs that cursively join to each other, so as to emu- late connected handwriting. For example, cursive joining is implemented in nearly all fonts for the Arabic scripts and in a few handwriting-like fonts for the Latin script. Cursive rendering is implemented by joining glyphs in the font, plus using a process that selects the particular joining glyph to represent each individual character occurrence, based on the joining nature of its neighboring characters. This glyph selection is implemented in some combination between the rendering engine and the font itself.

316 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.2 Layout Controls

In cases where cursive joining is implemented, on occasion an author may wish to override the normal automatic selection of joining glyphs. Typically, this choice is made to achieve one of the following effects: • Cause nondefault joining appearance (for example, as is sometimes required in writing Persian using the Arabic script). • Exhibit the joining-variant glyphs themselves in isolation. The Unicode Standard provides a means to influence joining glyph selection, by means of the two characters U+200C   - and U+200D   . Logically, these characters do not modify the contextual selection process itself, but rather they change the context of a particular character occurrence. By providing a non-joining neighbor character where otherwise the neighbor would be joining, or vice versa, they deceive the rendering process into selecting a different joining glyph. This process can be used in two ways: 1. Prevent joining appearance. For example, š U+0635    ZW   - NJ U+200C © U+0644    would be rendered as ©š (that is, the normal cursive joining of the interior sad and lam is overridden). Without the   -, it would be rendered as Ã

2. Exhibit joining glyphs in isolation. For example, ZW    J U+200D Ÿ U+063A    ZW    J U+200D

would be rendered as µ (that is, the medial glyph form of the ghain appears in isola- tion). Without the    before and after, it would be rendered as Ÿ

The preceding examples are adapted from the Iranian national coded character set stan- dard, ISIRI 3342, which defines these characters as “pseudo space” and “pseudo connec- tion,” respectively. The    does have specific interpretations in certain scripts as specified in this standard. For example, in Indic scripts it provides an invisible neighbor to which a dead consonant may join to induce a half-consonant form (see Section 9.1, Devanagari). It is not to be used for arbitrary “glueing” of textual elements together, such as for ligatures or marking index items.   - or    are format control characters. As with other such characters, they should be ignored by processes that analyze text content. For example, a spelling-checker or find/replace operation should filter them out. (See Section 2.7, Special Character and Noncharacter Values, for a general discussion of format control characters.) The effect of these characters in display depends on the context in which they are found. Adding a    between characters that are already cursively connected will

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 317 13.2 Layout Controls Special Areas and Format Characters have no effect. Adding a   - between characters that are unconnected will also have no effect. For example, any number of   - or    characters sprinkled into an English text stream will have no effect on its appearance when rendered in a typical noncursive Latin font.

Controlling Ligatures Although    and   - should not affect ligating behavior, in some systems they may break up ligatures by interrupting the character sequence required to form the ligature. For example, a cursive Latin font would produce the results shown in Figure 13-2. Figure 13-2. Ligature Example Memory Representation Rendering f i s h f- -i- -s- -h fi- -s- -h optionally using a ligature ZW f J i s h f- -i- -s- -h fi- -s- -h optionally using a ligature ZW f NJ i s h f i- -s- -h

ZW ZW f J NJ i s h f- i- -s- -h

ZW ZW f NJ J i s h f -i- -s- -h

Usage of optional ligatures such as fi is not currently controlled by any codes within the Unicode Standard but is determined by protocols or resources external to the text sequence where hyphens indicate cursive joining.

Bidirectional Ordering Codes These codes are used in the bidirectional algorithm, described in Chapter 3, Conformance. Systems that handle bidirectional scripts (Arabic and Hebrew) should be sensitive to these codes. The codes appear in Table 13-1. Table 13-1. Bidirectional Ordering Codes

Code Name Abbrev. U+200E --   U+200F --   U+202A --   U+202B --   U+202C     U+202D --   U+202E --  

As with the other zero-width character codes, except for their effect on the layout of the text in which they are contained, the bidirectional ordering characters can be ignored by the processing software. For nonlayout text processing, such as sorting, searching, and so on, the zero-width layout characters may be ignored. However, operations that modify text must maintain these characters correctly, because the matching pairs of zero-width format- ting characters must be coordinated (see Chapter 3, Conformance).

318 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.2 Layout Controls

U+200E --  and U+200F --  have the semantics of an invisible character of zero width, except that these characters have strong directionality. They are intended to be used to resolve cases of ambiguous directionality in the context of bidirectional texts. Unlike U+200B   , these characters carry no word- break semantics. (See Section 3.12, Bidirectional Behavior, for more information.)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 319 13.3 Deprecated Format Characters Special Areas and Format Characters

13.3 Deprecated Format Characters

Deprecated Format Characters: U+206A–U+206F Three pairs of deprecated format characters are encoded in this block: • Symmetric swapping format characters used to control the glyphs that depict characters such as “(”. (The default state is activated.) • Character shaping selectors used to control the shaping behavior of the Arabic compatibility characters. (The default state is inhibited.) • Numeric shape selectors used to override the normal shapes of the Western Digits. (The default state is nominal.) The use of these character shaping selectors and codes for digit shapes is strongly discour- aged in the Unicode Standard. Instead, the appropriate character codes should be used with the default state. For example, if contextual forms for Arabic characters are desired, then the nominal characters should be used, and not the presentation forms with the shaping selectors. Similarly, if the Arabic digit forms are desired, then the explicit characters should be used, such as U+0660 -  . Symmetric Swapping. The symmetric swapping format characters are used in conjunction with the class of left- and right-handed pairs of characters (symmetric characters), such as parentheses. The characters thus affected are listed in Section 4.7, Mirrored—Normative. They indicate whether the interpretation of the term  or  in the character names should be interpreted as meaning opening or closing, respectively. They do not nest. The default state of symmetric swapping may be set by a higher-level protocol or standard, such as ISO 6429. In the absence of such a protocol, the default state is activated. From the point of encountering U+206A    format character up to a subsequent U+206B    (if any), the symmetric char- acters will be interpreted and rendered as left and right. From the point of encountering U+206B    format character up to a subsequent U+206A    (if any), the symmetric charac- ters will be interpreted and rendered as opening and closing. This state (activated) is the default state in the absence of any symmetric swapping code or a higher-level protocol. Character Shaping Selectors. The character shaping selector format characters are used in conjunction with Arabic presentation forms. During the presentation process, certain let- terforms may be joined together in cursive connection or ligatures. The shaping selector codes indicate that the character shape determination (glyph selection) process used to achieve this presentation effect is to be either activated or inhibited. The shaping selector codes do not nest. From the point of encountering a U+206C     format charac- ter up to a subsequent U+206D     (if any), the character shaping determination process should be inhibited. If the backing store contains Arabic presentation forms (for example, U+FE80..U+FEFC), then these forms should be pre- sented without shape modification. This state (inhibited) is the default state in the absence of any character shaping selector or a higher-level protocol. From the point of encountering a U+206D     format char- acter up to a subsequent U+206C     (if any), any Arabic

320 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.3 Deprecated Format Characters presentation forms that appear in the backing store should be presented with shape modi- fication by means of the character shaping (glyph selection) process. The shaping selectors have no effect on nominal Arabic characters (U+0660..U+06FF), which are always subject to character shaping (glyph selection) and which are unaffected by these formatting codes. Numeric Shape Selectors. The numeric shape selector format characters allow the selec- tion of the shapes in which the digits U+0030..U+0039 are to be rendered. These format characters do not nest. From the point of encountering a U+206E    format character up to a subsequent U+206F    (if any), the European digits (U+0030.. U+0039) should be depicted using the appropriate national digit shapes as specified by means of appropriate agreements. For example, they could be displayed with shapes such as the -  (U+0660..U+0669). The actual character shapes (glyphs) used to display national digit shapes are not specified by the Unicode Standard. From the point of encountering a U+206F    format character up to a subsequent U+206E    (if any), the European digits (U+0030.. U+0039) should be depicted using glyphs that represent the nominal digit shapes shown in the code tables for these digits. This state (nominal) is the default state in the absence of any numeric shape selector or a higher-level protocol.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 321 13.4 Surrogates Area Special Areas and Format Characters

13.4 Surrogates Area

Surrogates Area: U+D800–U+DFFF The Surrogates Area consists of 1,024 low-half surrogate code values and 1,024 high-half surrogate code values, which are interpreted in pairs to access more than 1 million code points. Except for private use, there are no such characters currently assigned in this version of this standard. For the formal definition of a surrogate pair and the role of surrogate pairs in the Unicode Conformance Clause, see Section 3.7, Surrogates, and Section 5.4, Handling Surrogate Pairs. The use of surrogate pairs in the Unicode Standard is formally equivalent to the Universal Transformation Format-16 (UTF-16) defined in ISO 10646. (For a complete statement of the UTF-16 extension mechanism, see Appendix C, Relationship to ISO/IEC 10646.) High-Surrogate. The high-surrogate code values are assigned to the range U+D800.. U+DBFF. The high-surrogate code value is always the first element of a surrogate pair. Low-Surrogate. The low-surrogate code values are assigned to the range U+DC00.. U+DFFF. The low-surrogate code value is always the second element of a surrogate pair. Private-Use High-Surrogates. The high-surrogate code values from U+DB80..U+DBFF are private-use high-surrogate code values (a total of 128 code values). Characters repre- sented by means of a surrogate pair, where the high-surrogate code value is a private-use high-surrogate, are private-use characters. This mechanism allows for a total of 131,068 (= 128 × 1024 – 4) private-use characters representable by means of surrogate pairs. (For more information on private-use characters, see the discussion of the Private Use Area.) The code tables do not have charts or name list entries for the range D800..DFFF because individual, unpaired surrogates merely have code values.

322 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.5 Private Use Area

13.5 Private Use Area

Private Use Area: U+E000–U+F8FF The Private Use Area is reserved for use by software developers and end users who need a special set of characters for their application programs. The code points in this area are reserved for private use and do not have defined, interpretable semantics except by private agreement. There are no charts for this area, as any character encoded in this area is privately defined. There are also private-use characters defined by means of surrogate pairs. (See the Surro- gates Area description for a specification of how those private-use characters are encoded.) Encoding Structure. By convention, the Private Use Area is divided into a Corporate Use subarea, starting at U+F8FF and extending downward in values, and an End User subarea, starting at U+E000 and extending upward. Corporate Use Subarea. Systems vendors and/or software developers may need to reserve some private-use characters for internal use by their software. The Corporate Use subarea is the preferred area for such reservations. Assignments of character semantics in this sub- area could be completely internal, hidden from the end users, and used only for vendor- specific application support, or they could be published as vendor-specific character assignments available to applications and end users. An example of the former case would be the assignment of a character code to a system support operation such as or ; an example of the latter case would be the assignment of a character code to a vendor-specific logo character such as Apple’s apple character. End User Subarea. The End User subarea is intended for private-use character definitions by end users or for scratch allocations of character space by end-user applications. Allocation of Subareas. Vendors may choose to reserve private-use codes in the Corporate Use subarea and make some defined portion of the End User subarea available for com- pletely free end-user definition. This convention is for the convenience of system vendors and software developers. No firm dividing line between the two subareas is defined, as dif- ferent users may have different requirements. No provision is made in the Unicode Stan- dard for avoiding a “stack-heap collision” between the two subareas in the Private Use Area. Promotion of Private-Use Characters. In future versions of the Unicode Standard, some characters that have been defined by one vendor or another in the Corporate Use subarea may be encoded elsewhere as regular Unicode characters if their usage is widespread enough that they become candidates for general use. The code positions in the Private Use Area are permanently reserved for private use—no assignment to a particular set of charac- ters will ever be endorsed by the Unicode Consortium.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 323 13.6 Specials Special Areas and Format Characters

13.6 Specials

Specials: U+FEFF, U+FFF0–U+FFFF The Specials block contains code values that are interpreted neither as control nor graphic characters but that are provided to facilitate current software practices. The 14 Unicode values from U+FFF0..U+FFFD are reserved for special character definitions. In addition to these 14 positions, 2 code values are specified here for use not as characters but as special signaling devices (described below).

Byte Order Mark (BOM) There are two circumstances where the character U+FEFF can have a special interpretation that is different from the normal semantics of a zero width no-break space (see Section 13.2, Layout Controls): 1. Unmarked Byte Order. Some machine architectures use the so-called big- endian byte order, while others use the little-endian byte order. When Unicode text is serialized into bytes, the bytes can go in either order, depending on the architecture. However, sometimes this byte order is not externally marked, which causes problems in interchange between different systems. 2. Unmarked Character Set. In some circumstances, the character set information for a stream of coded characters (such as a file) is not available. The only infor- mation available is that the stream contains text, but the precise character set is not known. In these two cases, the character U+FEFF can be used as a signature to indicate the byte order and the character set by using the UTF-16 serialization described in Section 3.8, Transformations. Because the byte-swapped version U+FFFE is always an illegal Unicode value, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text. In the UTF-16 serialization, U+FEFF at the very beginning of a file or stream explicitly sig- nals the byte order.

The byte sequence FE16 FF16 may serve as a signature to identify a file as containing Uni- code text. This sequence is exceedingly rare at the outset of text files using other character encodings, whether single- or multiple-byte, and therefore not likely to be confused with real text data. For example, in systems that employ ISO Latin 1 (ISO/IEC 8859-1) or the ANSI Code Page 1252, the byte sequence FE16 FF16 constitutes the string , y diaeresis “þÿ”; in systems that employ the Apple Macintosh Roman character set or the Adobe Standard Encoding, this sequence represents the sequence ogonek, hacek “À¬”; in systems that employ other common IBM PC Code Pages (for example, CP 437, 850), this sequence represents black square, no-break space “ ”.

In UTF-8, the BOM corresponds to the byte sequence EF16 BB16 BF16. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings. For example, in systems that employ Microsoft Windows ANSI Code Page 1252, EF16 FF16 BF16 corresponds to the sequence i diaeresis, guillemet, inverted question mark “ï » ¿”.

324 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.6 Specials

Where the character set information is explicitly marked, such as in UTF-16BE or UTF- 16LE, then all U+FEFF characters, even at the very beginning of the text, are to be inter- preted as zero width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF characters are also not required and are to be interpreted as zero width no- break spaces. For example, for strings in an API, the memory architecture of the processor provides the explicit byte order. For databases and similar structures, it is much more effi- cient and robust to use a uniform byte order for the same field (if not the entire database), thereby avoiding use of the byte order mark. Systems that use the byte order mark must recognize that an initial U+FEFF signals the byte order; it is not part of the textual content. It should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. To represent an - tial U+FEFF   -  in a UTF-16 file, use U+FEFF twice in a row. The first one is a byte order mark; the second one is the initial zero width no-break space. See Table 13-2 for a summary of encoding form signatures. Table 13-2. Unicode Encoding Form Signatures Encoding Form Signature UTF-8 EF BB BF UTF-16 Big-Endian FE FF UTF-16 Little-Endian FF FE UCS-4a Big-Endian 00 00 FE FF UCS-4 Little-Endian FF FE 00 00 a. For UCS-4, see Section C.2, Encoding Forms in ISO/IEC 10646.

Annotation Characters An interlinear annotation consists of annotating text that is related to a sequence of anno- tated characters. For all regular editing and text-processing algorithms, the annotated char- acters are treated as part of the text stream. The annotating text is also part of the content, but for all or some text processing, it does not form part of the main text stream. However, within the annotating text, characters are accessible to the same kind of layout, text-pro- cessing, and editing algorithms as the base text. The annotation characters delimit the annotating and the annotated text, and identify them as part of an annotation. See Figure 13-3. Figure 13-3. Annotation Characters Text display Felix Annotation Annotation characters Text stream characters

Annotated Annotating Annotated Annotating text text text text

The annotation characters are used in internal processing when out-of-band information is associated with a character stream, very similarly to the usage of the U+FFFC   . However, unlike the opaque objects hidden by the latter char- acter, the annotation itself is textual.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 325 13.6 Specials Special Areas and Format Characters

Conformance. A conformant implementation that supports annotation characters inter- prets the base text as if it were part of an unannotated text stream. Within the annotating text, it interprets the annotating characters with their regular Unicode semantics. U+FFF9    is an anchor character, preceding the interlin- ear annotation. The exact nature and formatting of the annotation is dependent on addi- tional information that is not part of the plain text stream. This situation is analogous to that for U+FFFC   . U+FFFA    separates the base characters in the text stream from the annotation characters that follow. The exact interpretation of this charac- ter depends on the nature of the annotation. More than one separator may be present. Additional separators delimit parts of a multipart annotating text. U+FFFB    terminates the annotation object (and returns to the regular text stream). Use in Plain Text. Usage of the annotation characters in plain text interchange is strongly discouraged without prior agreement between the sender and the receiver because the con- tent may be misinterpreted otherwise. Simply filtering out the annotation characters on input will produce an unreadable result or, even worse, an opposite meaning. On input, a plain text receiver should either preserve all characters or remove the interlinear annota- tion characters as well as the annotating text included between the  -   and the   . When an output for plain text usage is desired and when the receiver is unknown to the sender, these interlinear annotation characters should be removed as well as the annotating text included between the    and the   . This restriction does not preclude the use of annotation characters in plain text inter- change, but it requires a prior agreement between the sender and the receiver for correct interpretation of the annotations. Lexical Restrictions. If an implementation encounters a paragraph break between an anchor and its corresponding terminator, it shall terminate any open annotations at this point. Anchor characters must precede their corresponding terminator characters. Unpaired anchors or terminators shall be ignored. A separator occurring outside a pair of delimiters, shall be ignored. Annotations may be nested. Formatting. All formatting information for an annotation is provided by higher-level pro- tocols. The details of the layout of the annotation are implementation-defined. Correct for- matting may require additional information not present in the character stream, but maintained out of band. Therefore, annotation markers serve as placeholders for an imple- mentation that has access to that information from another source. Collation. Except for the special case where the annotation is intended to be used as a sort- key, annotations are typically ignored for collation, or optionally preprocessed to act as breakers only. Importantly, annotation base characters are not ignored, but treated like reg- ular text.

Replacement Characters U+FFFC. The U+FFFC    is used as an insertion point for objects located within a stream of text. All other information about the object is kept out- side the character data stream. Internally it is a dummy character that acts as an anchor point for the object’s formatting information. In addition to assuring correct placement of

326 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Special Areas and Format Characters 13.6 Specials an object in a data stream, the object replacement character allows the use of general stream-based algorithms for any textual aspects of embedded objects. U+FFFD. The U+FFFD   is the general substitute character in the Unicode Standard. That character can be substituted for any “unknown” character in another encoding that cannot be mapped in terms of known Unicode values (see Section 5.3, Unknown and Missing Characters).

Noncharacters U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value. Its occurrence in a stream of Unicode data strongly suggests that the Unicode characters should be byte-swapped before interpretation. U+FFFE should be interpreted only as an incorrectly byte-swapped version of U+FEFF   - , also known as the byte order mark. U+FFFF. The 16-bit unsigned hexadecimal value U+FFFF is not a Unicode character value; it may be used by an application as a error code or other noncharacter value. The specific interpretation of U+FFFF is not defined by the Unicode Standard, so it can be viewed as a kind of private-use noncharacter.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 327 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 14 Code Charts 14

Disclaimer Character images shown in the code charts are not prescriptive. In actual fonts, considerable variations are to be expected.

The code charts that follow present the characters of the Unicode Standard. Characters are organized into related groups called blocks. In the Unicode Standard, character blocks gen- erally contain characters from a single script. In many cases, a script is fully represented in its character block. There are, however, important exceptions, most notably in the area of punctuation characters. A character names list follows each character chart, except for CJK ideographs and Hangul syllables, as discussed in Section 14.2, CJK Unified Ideographs, and Section 14.3, Hangul Syl- lables. The character names list itemizes every character in the block and provides supple- mentary information in many cases. An index to distinctive character names is at the back of this book; a full set of character names (including earlier Version 1.0 names) are in the Unicode Character Database on the CD-ROM.

14.1 Character Names List The following illustration identifies the components of typical entries in the character names list. code image entry 00AE  REGISTERED SIGN = REGISTERED TRADE MARK SIGN (Version 1.0 name) 00AF ¯ MACRON (Unicode name) = overline, APL overbar (alternative names) • this is a spacing character (informative note) → 02C9 ¯ modifier letter macron (cross reference) → 0304 þ öcombining macron → 0305þ úcombining overline ­ 0020 0304 þö (compatibility decomposition)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 331 14.1 Character Names List Code Charts

00E5 å LATIN SMALL LETTER A WITH RING ABOVE • Danish, Norwegian, Swedish, Walloon Æ 0061 a 030A þç (canonical decomposition)

Images in the Code Charts and Character Lists Each character in these code charts is shown with a representative glyph. A representative glyph is not a prescriptive form of the character, but one that enables recognition of the intended character to a knowledgeable user and facilitates lookup of the character in the code charts. In many cases, there are more or less well-established alternative glyphic repre- sentations for the same character. Designers of high-quality fonts will do their own research into the preferred glyphic appearance of Unicode characters. In addition, many scripts require context-dependent glyph shaping, glyph positioning, or ligatures, none of which is shown in the code charts. The representative glyphs in the code charts are based on a serifed, Times-like font. For example, even the ASCII character U+0061     has two common alter- native forms, the “a” used in Times and the “¶” that occurs in many other font styles. In a Times-like font, the character U+03A5     looks like “Y”; the form : is common in other font styles. A different case is U+010F      , which is commonly type- set as ! instead of ". In such cases, the code charts show the more common variant in pref- erence to a more didactic archetypical shape. Many characters have been unified and have different appearances in different language contexts. The shape shown for U+2116 ï   is a fullwidth shape as it would be used in East Asian fonts. In Cyrillic usage,  is the universally recognized glyph. In many cases, characters need to be represented by more or less condensed, shifted, or dis- torted glyphs to make them fit the format of the code charts. For example, U+0D10 ƒ    is shown in a reduced size to fit the character cell. Sometimes characters need to be given artificial shapes to make them recognizable in the code charts. Examples are U+00AD    and U+2011 / - , where the special behavior of the hyphen is indicated by the dashed box and the letters. When characters are used in context, the surrounding text will give important clues as to identity, size, and positioning. In the code charts, these clues are absent. For example, U+2075 Ú  is shown much smaller than it would be in a Times-like text font. Combining characters are shown with a dotted circle—for example, U+0940 0 -    . The relative position of the dotted circle gives an approximate indica- tion of the location of the base character in relation to the combining mark. During rendering, additional adjustments are necessary. Accents such as U+0302  -   are adjusted vertically and horizontally based on the height and width of the base character, as in “Ó” versus “Ù”. For non-European scripts, typical typefaces were selected that allow as much distinction as possible among the different characters. The Unicode Standard contains many characters that are used in writing minority lan- guages or that are historical characters, often used primarily in manuscripts or inscriptions. Where there is no strong tradition of printed materials, the typography of a character may not be settled.

332 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Code Charts 14.1 Character Names List

Cross References Cross-referenced characters (preceded by →) have various characteristics: explicit inequal- ity, the other member of a case pair, or some other linguistic relationship. Explicit . The two characters are not identical, although the glyphs that depict them are identical or very close. 003A : COLON → 0589 : armenian full stop → 2236 : ratio Other Linguistic Relationships. These relationships include transliterations (such as between Serbian and Croatian), typographically unrelated characters used to represent the same sound, and so on. 01C9 lj LATIN SMALL LETTER LJ → 0459 ¾ cyrillic small letter ­ 006C l 006A j

Case Form Mappings When a case mapping corresponds solely to a difference based on  versus  in the names of the characters, the case mapping is not given in the names list but only in the Unicode Character Database on the CD-ROM. 0041 A LATIN CAPITAL LETTER A 01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z ­ 0044 D 007A z When the case mapping cannot be predicted from the name, the information is given in a note. 00DF ß LATIN SMALL LETTER SHARP S = Eszett • German • uppercase is “SS” • in origin a ligature of 017F ä and 0073 Ø → 03B2 — greek small letter beta

Decompositions The decomposition sequence (one or more letters) given for a character is either its canon- ical mapping or its compatibility mapping. The canonical mapping is marked with an iden- tical to symbol Æ. 00E5 å LATIN SMALL LETTER A WITH RING ABOVE • Danish, Norwegian, Swedish, Walloon Æ 0061 a 030A þç 212B Å ANGSTROM SIGN Æ 00C5 Å latin capital letter a with ring above Compatibility mappings are marked with an almost equal to symbol ­. Formatting infor- mation may be indicated inside angle brackets.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 333 14.1 Character Names List Code Charts

01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z ­ 0044 D 007A z FF21 † FULLWIDTH LATIN CAPITAL LETTER A ­ 0041 A The following compatibility formatting tags are used in the Unicode Character Database:

A font variant (for example, a black letter form) A no-break version of a space, hyphen, or other punctuation An initial presentation form (Arabic) A medial presentation form (Arabic) A final presentation form (Arabic) An isolated presentation form (Arabic) An encircled form A superscript form A subscript form A vertical layout presentation form A wide (or zenkaku) compatibility character A narrow (or hankaku) compatibility character A small variant form (CNS compatibility) A CJK squared font variant A vulgar fraction form Otherwise unspecified compatibility character

In the character names list accompanying the code charts, the “” label is sup- pressed, but all other compatibility formatting tags are explicitly listed in the compatibility mapping. Decompositions are not necessarily full decompositions. For example, the decomposition for U+212B ª   can be further decomposed using the canonical mapping for U+00C5 ª       . (For more information on decomposition, see Section 3.6, Decomposition.)

Information About Languages An informative note may include a list of the language(s) using that character where this information is considered useful. For case pairs, the annotation is given only for the lower- case form, to avoid needless repetition. An ellipsis “...” indicates that the listed languages cited are merely the principal ones among many.

Reserved Characters Character codes that are marked “” are unassigned and reserved for future encoding. Reserved codes are indicated by a ƒ glyph. 060D ƒ Reserved codes may also have cross references to assigned characters located elsewhere. 2073 ƒ → 00B3 3 superscript three

334 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Code Charts 14.2 CJK Unified Ideographs

Character codes that are marked “” are permanently unassigned; they will never be assigned a character. These codes are indicated by a î glyph. FFFF î • the value FFFF î is guaranteed not to be a Unicode charac- ter at all

14.2 CJK Unified Ideographs A character names list is not provided for the CJK Unified Ideographs and CJK Unified Ideo- graphs Vertical Extension A character blocks because the name of a unified ideograph sim- ply consists of its Unicode value preceded by   -. As with other character charts, each Unicode character in these blocks is shown with its Unicode value and a single representative glyph. Note that varying typographic practices throughout East Asia may require glyphs other than the representative one to be used so that the display is correct for a particular country or language. A table providing mappings between the CJK ideographs included in the Unicode Standard and those in other character set standards is included on the CD-ROM. A radical-stroke index to CJK ideographs is in Section 15.1, Han Radical-Stroke Index. An index in Shift-JIS order of the ideographs in JIS X 0208 can be found in Section 15.2, Shift-JIS Index.

14.3 Hangul Syllables A character names list is not provided for characters in the Hangul Syllables Area (U+AC00..U+D7A3) because the name of a Hangul syllable can be determined by algo- rithm as described in Section 3.11, Conjoining Jamo Behavior.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 335 14.3 Hangul Syllables Code Charts The code charts, pages 336-846 in the book, are omitted here. Please see the online code charts at http://www.unicode.org/charts/ Note: The online code charts are continuously updated and may contain characters added after the publication of The Unicode Standard, Version 3.0. To find out whether a particular character was part of the Unicode Standard, Version 3.0, please consult either the printed edition of the standard (ISBN 0-201-61633-5) or version 3.0.0 of the Uni- code Character Database. Normative references to the Unicode Standard, Version 3.0 should use the printed edition.

336 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 15 Han Indices 15

To expedite locating specific Han ideographic characters within the Unicode Han ideo- graphic set, this chapter contains a radical-stroke index and a shift-JIS index.

15.1 Han Radical-Stroke Index Under the traditional radical-stroke system, each Han ideograph is considered to be writ- ten with one of a number of different character elements or radicals and a number of addi- tional strokes. For example, the character # has the radical  and seven additional strokes. To find the character # within a dictionary, one would first locate the section for its radi- cal, , and then find the subsection for characters with seven additional strokes. This method is complicated by the fact that there are occasional ambiguities in the count- ing of strokes. Even worse, some characters are considered by different authorities to be written with different radicals; there is not, in fact, universal agreement about which set of radicals to use for certain characters, particularly with the increased use of simplified characters. The most influential authority for radical-stroke information is the eighteenth-century KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radi- cals today is that many simplified characters are difficult to classify under any of the 214 KangXi radicals. As a result, various modern radical sets have been introduced. None, how- ever, is in general use, and the 214 KangXi radicals remain the best known. See “CJK and KangXi Radicals” in Section 10.1, Han. The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard follows a number of different sources for radical-stroke classification. Where two sources are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts. Simplified characters are, as a rule, considered to have the same radical as their traditional forms and are found under the appropriate radical. For example, the character  is found under the same radical, , as its traditional form ().

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 849 15.1 Han Radical-Stroke Index Han Indices The Han Radical-Stroke Index is available as a separate file.

850 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Appendix A Han Unification History A

Efforts to create a unified Han character encoding are at least as venerable as the existing national standards. The Chinese Character Code for Information Interchange (CCCII), first developed in Taiwan in 1980, contains characters for use in China, Taiwan, and Japan. In somewhat modified form, it has been adopted for use in the United States as ANSI Z39.64-1989, also known as the East Asian Character Code (EACC) for bibliographic use. In 1981, Takahashi Tokutaro of Japan’s National Diet Library proposed standardization of a character set for common use among East Asian countries. The Unicode Han character set began with a project to create a Han character cross-refer- ence database at Xerox in 1986. In 1988, a parallel effort began at Apple based on the Research Libraries Group’s CJK Thesaurus, which is used to maintain EACC. The merger of the Apple and Xerox databases in 1989 led to the first draft of the Unicode Han character set. At the September 1989 meeting of X3L2 (an accredited standards committee for codes and character sets operating under the procedures of the American National Standards Institute), the Unicode Working Group proposed this set for inclusion in ISO 10646. The primary difference between the Unicode Han character repertoire and earlier efforts was that the Unicode Han character set extended the bibliographic sets to guarantee com- plete coverage of industry and newer national standards. The unification criteria employed in this original Unicode Han character repertoire were based on rules used by JIS and on a set of Han character identity principles (rentong yuanze) being developed in China by experts working with the Association for a Common Chinese Code (ACCC). An important principle was to preserve all character distinctions within existing and proposed national and industry standards. The Unicode Han proposal stimulated interest in a unified Han set for inclusion in ISO 10646, which led to an ad hoc meeting to discuss the issue of unification, held in Beijing in October 1989. The October 1989 meeting was the beginning of informal cooperation between the Unicode Working Group and the ACCC to exchange information on each group’s proposals for Han unification. A second ad hoc meeting on Han Unification was held in Seoul in February 1990. At this meeting, the Korean delegation proposed the establishment of a group composed of the East Asian countries and other interested organizations to study a unified Han encoding. From this informal meeting emerged the Chinese/Japanese/Korean Joint Research Group (hereafter referred to as the CJK-JRG). A second draft of the Unicode Han character repertoire was sent out for widespread review in December 1990 to coincide with the announcement of the formation of the Unicode Consortium. The December 1990 draft of the Unicode Han character set differed from the first draft in that it used the principle of KangXi radical-stroke ordering of the characters.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 961 Han Unification History

To verify independently the soundness and accuracy of the unification, the Consortium arranged to have this draft reviewed in detail by East Asian scholars at the University of Toron to. In the meantime, China announced that it was about to complete its own proposal for a Han Character Set, GB 13000. Concluding that the two drafts were similar in content and philosophy, the Unicode Consortium and the Center for Computer and Information Development Research, Ministry of Machinery and Electronic Industry (CCID, China’s computer standards body) agreed to merge the two efforts into a single proposal. Each added missing characters from the other set and agreed upon a method for ordering the characters using the four-dictionary ordering scheme described in Section 10.1, Han. Both proposals benefited greatly from programmatic comparisons of the two databases. As a result of the agreement to merge the Unicode Standard and ISO 10646, the Unicode Consortium agreed to adopt the unified Han character repertoire that was to be developed by the CJK-JRG. The first CJK-JRG meeting was held in Tokyo in July 1991. The group recognized that there was a compelling requirement for unification of the existing CJK ideographic characters into one coherent coding standard. Two basic decisions were made: to use GB 13000 (pre- viously merged with the Unicode Han repertoire) as the basis for what would be termed “The Unified Repertoire and Ordering,” and to verify the unification results based on rules that had been developed by Professor Miyazawa Akira and other members of the Japanese delegation. The formal review of GB 13000 began immediately. Subsequent meetings were held in Beijing and Hong Kong. On March 27, 1992, the CJK-JRG completed the Unified Repertoire and Ordering (URO), Version 2.0. This repertoire was subsequently published both by the Unicode Consortium in The Unicode Standard, Version 1.0, Volume 2 and by ISO in ISO/ IEC 10646-1:1993. In October 1993, the CJK-JRG became a formal subgroup of ISO/IEC JTC1/SC2/WG2 and was renamed the Ideographic Rapporteur Group (IRG). The IRG now has the formal responsibility of developing extensions to the URO 2.0 to expand the encoded repertoire of unified CJK ideographs. The Unicode Consortium participates in this group as a liaison member of ISO. In its second meeting in Hanoi in February 1994, the IRG agreed to include Vietnamese ChÔ Nôm ideographs in a future version of the URO and to add a fifth reference dictionary to the ordering scheme. In 1998, the IRG completed work on the first ideographic supplement to the URO, the CJK Unified Ideographs Extension A. This set of 6,582 characters is culled from national and industrial standards and historical literature and is encoded in The Unicode Standard, Ver- sion 3.0. The CJK Unified Ideographs Extension A represents the final set of CJK ideo- graphs that will be encoded without the use of surrogates. At the present time (summer 1999), the IRG is considering additional CJKV ideographs submitted by China, Taiwan, Hong Kong, Japan, Korea, Singapore, and Vietnam as further extensions to the ideographic character repertoire in this standard.

962 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Appendix C Relationship to ISO/IEC 10646 C

The Unicode Consortium maintains a strong working relationship with ISO/IEC/JTC1/ SC2/WG2, the working group developing International Standard 10646. Today both orga- nizations are firmly committed to maintaining the synchronization between the Unicode Standard and 10646. Each standard nevertheless uses its own form of reference and to some degree separate terminology. This appendix gives a brief history and explains how the stan- dards are related.

C.1 History Having recognized the benefits of developing a single universal character code standard, members of the Unicode Consortium worked with representatives from the International Organization for Standardization (ISO) during the summer and fall of 1991 to pursue this goal. Meetings between the two bodies resulted in mutually acceptable changes to both Unicode Version 1.0 and the first ISO/IEC Draft International Standard DIS 10646.1, which merged their combined repertoire into a single numerical character encoding. This work culminated in The Unicode Standard, Version 1.1. ISO/IEC 10646-1:1993, Information Technology—Universal Multiple-Octet1 Coded Charac- ter Set (UCS)—Part 1: Architecture and Basic Multilingual Plane was published in May 1993 after final editorial changes were made to accommodate the comments of voting members. The Unicode Standard, Version 1.1 reflected the additional characters introduced from the DIS 10646.1 repertoire and incorporated minor editorial changes. Merging The Unicode Standard, Version 1.0 and DIS 10646.1 consisted of aligning the numerical values of identical characters and then filling in some groups of characters that were present in DIS 10646.1, but not in the Unicode Standard. As a result, the character code values of ISO/IEC 10646-1:1993 UCS-2 and The Unicode Standard, Version 1.1 are precisely the same. Versions 2.0 and 3.0 of the Unicode Standard have added more charac- ters, matching recent additions to ISO/IEC 10646-1:2000. Table C-1 gives the timeline for these efforts.

1. Octet is ISO/IEC terminology for byte—that is, an ordered sequence of 8 bits considered as a unit.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 967 C.1 History Relationship to ISO/IEC 10646

Table C-1. Timeline

Year Version Summary 1989 DP 10646 Draft proposal, independent of Unicode 1990 Unicode Prepublication Prepublication review draft 1990 DIS-1 10646 First draft, independent of Unicode 1991 Unicode 1.0 Edition published by Addison-Wesley 1992 Unicode 1.0.1 Modified for merger compatibility 1992 DIS-2 10646 Second draft, merged with Unicode 1993 IS 10646-1:1993 Merged standard 1993 Unicode 1.1 Revised to match IS 10646-1:1993 1995 10646 amendments Korean realigned, plus additions 1996 Unicode 2.0 Synchronized with 10646 amendments 1998 Unicode 2.1 Added euro sign and corrigenda 1999 10646 amendments Additions 2000 Unicode 3.0 Synchronized with 10646 second edition 2000 (expected) IS 10646-1:2000 10646 part 1, second edition, publication with amendments to date

Unicode 1.0 The combined repertoire presented in ISO/IEC 10646 is a superset of The Unicode Stan- dard, Version 1.0 repertoire as amended by The Unicode Standard, Version 1.0.1. The Uni- code Standard, Version 1.0 was amended by the Unicode 1.0.1 Addendum to make the Unicode Standard a proper subset of ISO/IEC 10646. This effort entailed both moving and eliminating a small number of characters.

Unicode 2.0 The Unicode Standard, Version 2.0 covered the repertoire of The Unicode Standard, Version 1.1 (and IS 10646), plus the first seven amendments to IS 10646, as follows: Amd. 1: UTF-16 Amd. 2: UTF-8 Amd. 3: Coding of C1 Controls Amd. 4: Removal of Annex G: UTF-1 Amd. 5: Korean Hangul Character Collection Amd. 6: Tibetan Character Collection Amd. 7: 33 Additional Characters (Hebrew, , Dong) In addition, The Unicode Standard, Version 2.0 also covered Technical Corrigendum No. 1 (on renaming of AE  to ) and such Editorial Corrigenda to ISO/IEC 10646 as were applicable to the Unicode Standard. The euro sign and the object replacement char- acter were added in Version 2.1, per amendment 18 of ISO 10646-1.

Unicode 3.0 The Unicode Standard, Version 3.0 is synchronized with the second edition of ISO/IEC 10646-1. The latter contains all of the published amendments to 10646-1; the list includes the first seven amendments, plus the following: Amd. 8: Addition of Annex T: Procedure for the Unification and Arrangement of CJK Ideographs Amd. 9: Identifiers for Characters

968 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Relationship to ISO/IEC 10646 C.2 Encoding Forms in ISO/IEC 10646

Amd. 10: Ethiopic Character Collection Amd. 11: Unified Canadian Aboriginal Syllabics Character Collection Amd. 12: Cherokee Character Collection Amd. 13: CJK Unified Ideographs with Supplementary Sources (Horizontal Extension) Amd. 14: Yi Syllables and Yi Radicals Character Collection Amd. 15: Kangxi Radicals, Hangzhou Numerals Character Collection Amd. 16: Braille Patterns Character Collection Amd. 17: CJK Unified Ideographs Extension A (Vertical Extension) Amd. 18: Miscellaneous Letters and Symbols Character Collection (which includes the euro sign) Amd. 19: Runic Character Collection Amd. 20: Ogham Character Collection Amd. 21: Sinhala Character Collection Amd. 22: Keyboard Symbols Character Collection Amd. 23: Bopomofo Extensions and Other Character Collection Amd. 24: Thaana Character Collection Amd. 25: Khmer Character Collection Amd. 26: Myanmar Character Collection Amd. 27: Syriac Character Collection Amd. 28: Ideographic Description Characters Amd. 29: Mongolian Amd. 30: Additional Latin and Other Characters Amd. 31: Tibetan Extension The second edition of 10646-1 also contains the contents of Technical Corrigendum No. 2 and all the Editorial Corrigenda to date. The synchronization of The Unicode Standard, Version 3.0 with the second edition of ISO/ IEC 10646-1 means that the repertoire, encoding, and names of all characters are identical between the two standards, and that all other material from the amendments to 10646-1 that have a bearing on the text of the Unicode Standard have been taken into account in the revision of the Unicode Standard.

C.2 Encoding Forms in ISO/IEC 10646 ISO/IEC 10646 defines two alternative forms of encoding: • A four-octet (32-bit) encoding containing 231 code positions. These code posi- tions are conceptually divided into 128 groups of 256 planes, each plane con- taining 256 rows of 256 cells. • A two-octet (16-bit) encoding consisting of plane zero, the Basic Multilingual Plane (BMP). The 32-bit form is referred to as UCS-4 (Universal Character Set coded in 4 octets) and the 16-bit form is referred to as UCS-2 (Universal Character Set coded in 2 octets). The code numbers from 0 through 65,535 decimal (0–FFFF hexadecimal) can be repre- sented by character code values of 16 bits. The most useful characters (that is, the charac- ters found in major existing standards worldwide) are assigned in the BMP. ISO/IEC 10646

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 969 C.3 UCS Transformation Formats Relationship to ISO/IEC 10646 does not currently define any characters in other planes. The encoding of characters in other planes will be covered by ISO/IEC 10646-2. ISO/IEC 10646-1 does not currently encode any characters outside of the BMP. The char- acter repertoires and encoding assignments of the Unicode Standard and ISO/IEC 10646-1 are identical.

Zero Extending The character “A”, U+0041    , has the unchanging numerical value 41 hexadecimal. This value may be extended by any quantity of leading zeros to serve in the context of the following fixed-length encoding standards (see Table C-2). Table C-2. Zero Extending

Bits Standard Binary Hex Dec Char 7 ASCII 1000001 41 65 A 8 8859-1 01000001 41 65 A 16 Unicode, UCS-2 00000000 01000001 41 65 A 32 UCS-4 00000000 00000000 00000000 01000001 41 65 A

This design eliminates the problem of disparate code values in all systems that use any of the standards just mentioned.

C.3 UCS Transformation Formats

UTF-8 The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an alterna- tive coded representation form for all of the characters of ISO/IEC 10646. The ISO/IEC definition is identical in format to UTF-8 as described in Section 2.3, Encoding Forms. UTF-8 can be used to transmit text data through communications systems that assume that individual octets in the range of x00 to x7F have a definition according to ISO/IEC 4873, including a C0 set of control functions according to the 8-bit structure of ISO/IEC 2022. UTF-8 also avoids the use of octet values in this range that have special significance during the parsing of file name character strings in widely used file-handling systems. UTF-8 was first defined in ISO/IEC 10646 Amendment 2 and will be included in the next edition of ISO/IEC 10646-1.

UTF-16 The term UTF-16 stands for UCS Transformation Format for 16 Planes of Group 00. UTF- 16 is the ISO/IEC encoding that is equivalent to the Unicode Standard with the use of sur- rogates as described in Chapter 3, Conformance. In UTF-16, each UCS-2 code value repre- sents itself. Non-BMP code values of ISO/IEC 10646 in planes 1..16 are represented using pairs of special codes. UTF-16 defines the transformation between the UCS-4 code posi- tions in planes 1 to 16 of Group 00 and the pairs of special codes, and is identical to the transformation defined in the Unicode Standard under D.28 in Section 3.7, Surrogates. Sample code for transforming UCS-4 into the Unicode Standard with surrogates is located on the CD-ROM.

970 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Relationship to ISO/IEC 10646 C.4 Synchronization of the Standards

In ISO/IEC 10646, high-surrogates are called RC-elements from the high-half zone and low- surrogates are called RC-elements from the low-half zone. Together, they constitute the S (Special) Zone of the BMP. UTF-16 represents the BMP and the next 16 planes. This system should not be an undue limitation because ISO JTC1/SC2/WG2 has no intention of assigning characters outside of planes 1..14 as it would break synchronization with the Unicode Standard. Planes 15 and 16 (000F0000..000FFFFF16 and 00100000..0010FFFF16) are reserved for private use. The UCS-4 private-use code values in groups 60 to 7F and in planes E0 to FF in group 00 are not accessible using UTF-16. Use of these private-use code values is strongly discouraged because data encoded with these code values will not be interchangeable with Unicode implementations. Planes 15 and 16 should be used instead. Applications interchanging ISO/IEC 10646 data containing non-BMP code values in planes 1..16 of ISO/IEC 10646 should use UTF-16 as the default encoding form in the absence of information to the contrary.

C.4 Synchronization of the Standards The goal of merging the Unicode Standard and DIS 10646.1 has been realized, making character code assignments identical in the Unicode Standard and ISO/IEC 10646. Pro- grammers and system users should treat the character code values from the Unicode Stan- dard and ISO/IEC 10646 as identities, especially in the transmission of raw character data across system boundaries. However, the Unicode Standard and ISO/IEC 10646 differ in the precise terms of their con- formance specifications. Any Unicode implementation will conform to ISO/IEC 10646, Level 3, but because Unicode Standard imposes additional constraints on character seman- tics and transmittability, not all implementations that are compliant with ISO/IEC 10646 will be compliant with the Unicode Standard.

C.5 Identification of Features for the Unicode Standard ISO/IEC 10646 provides mechanisms for specifying a number of implementation parame- ters, generating what may be termed instantiations of the standard. ISO/IEC 10646 con- tains no means of explicitly declaring the Unicode Standard as such. As a whole, however, the Unicode Standard may be considered as encompassing the entire repertoire of ISO/IEC 10646 and having the following features (as well as additional semantics): • Numbered subset 300 (BMP) • UTF-16 (if surrogates are used; UCS-2 otherwise) • Implementation level 3 (allowing both combining marks and precomposed characters) • Device type 1 (receiving device with full retransmission capability) Few applications are expected to make use of all of the characters defined in the ISO/IEC 10646 Basic Multilingual Plane. The conformance clauses of the two standards address this situation in very different ways. ISO/IEC 10646 provides a mechanism for specifying included subsets of the character repertoire, permitting implementations to ignore charac- ters that are not included (see Informative Annex A of ISO/IEC 10646). A Unicode imple- mentation requires a minimal level of handling all character codes—namely, the ability to store and retransmit them undamaged. Thus the Unicode Standard encompasses the entire

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 971 C.6 Character Names Relationship to ISO/IEC 10646

ISO/IEC 10646 Basic Multilingual Plane without requiring that any particular subset be implemented. The Unicode Standard does not provide formal mechanisms for identifying a stream of bytes as Unicode characters, although to some extent this function is served by use of the byte order mark (U+FEFF) to indicate byte ordering. ISO/IEC 10646 defines an ISO/IEC 2022 control sequence to introduce the use of 10646. ISO/IEC 10646 also allows the use of U+FEFF as a “signature” as described in ISO/IEC 10646. This optional “signature” conven- tion for discerning between forms UCS-2 and UCS-4 is brought to the attention of Uni- code implementers. This method is summarized in Section 13.6, Specials.

C.6 Character Names Unicode character names follow the ISO/IEC character naming guidelines (summarized in Informative Annex K of ISO/IEC 10646). In the prior version of the Unicode Standard, the naming convention followed the ISO/IEC naming convention,1 but with some differences that were largely editorial. For example, ISO/IEC 10646 name 029A       Unicode 1.0 name 029A      In the ISO/IEC framework, the unique character name is viewed as the major resource for both character semantics and cross-mapping among standards. In the framework of the Unicode Standard, character semantics are indicated via alias names, usage annotations, character properties, and functional specifications as mentioned in Chapter 3, Con- formance; cross-mappings among standards are provided in the form of explicit tables. The disparities between the Unicode Standard, Version 1.0 names and ISO/IEC 10646 names have been remedied by adoption of ISO/IEC 10646 names in the Unicode Standard. If the Unicode Standard, Version 1.0 name differed from the ISO/IEC 10646 name, then the pre- vious name is provided as an alias in the Unicode Character Database.

C.7 Character Functional Specifications The core of a character code standard is a mapping of code values to characters, but in some cases the semantics or even the identity of the character may be unclear. Certainly a character is not simply the representative glyph used to depict it in the standard. For this reason, the Unicode Standard supplies the necessary information to specify the semantics of the characters it encodes. Thus the Unicode Standard consists of far more than a chart of code values. It also contains a set of extensive character functional specifications and data, as well as substantial back- ground material designed to help implementers better understand how the characters interact. The Unicode Standard specifies properties and algorithms. Conformant imple- mentations of the Unicode Standard will also be conformant with ISO/IEC 10646, Level 3. Compliant implementations of ISO/IEC 10646 can be conformant to the Unicode Stan- dard—as long as the implementations conform to all additional specifications that apply to the characters of their adopted subsets, and as long as they support all Unicode characters outside their adopted subsets in the manner referred to in Section C.5, Identification of Fea- tures for the Unicode Standard.

1. The names adopted by the Unicode Standard are from the English-language version of ISO/IEC 10646, even when other language versions are published by ISO.

972 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Appendix D Changes from Unicode Version 2.0 D

D.1 Versions of the Unicode Standard The Unicode Technical Committee periodically updates the Unicode Standard to respond to industry needs and to maintain consistency with ISO/IEC 10646. There have been five formally numbered major and minor versions of the Unicode Standard. The relationship between these versions of Unicode and ISO/IEC 10646 is shown in Table D-1. For more detail on the relationship of Unicode and ISO/IEC 10646, see Appendix C, Relationship to ISO/IEC 10646. Table D-1. Versions of the Unicode Standard and ISO/IEC 10646-1

Year Version Published ISO/IEC 10646-1 1991 Unicode 1.0 Vol. 1, Addison-Wesley Basis of Draft-2 ISO 10646-1 1992 Unicode 1.0.1 Vol. 1, 2, Addison-Wesley Interim merger version 1993 Unicode 1.1 Technical Report #4 Matches ISO 10646-1 1996 Unicode 2.0 Addison-Wesley Matches ISO 10646-1 plus amendments 1998 Unicode 2.1 Technical Report #8 Matches ISO 10646-1 plus amendments 2000 Unicode 3.0 Addison-Wesley Matches ISO 10646-1 second edition

The Unicode Standard has grown from having 28,302 assigned character values in Version 1.0, to having 49,194 assigned character values in Version 3.0. Table D-2 documents the number of characters assigned in the different versions of the Unicode Standard. Table D-2. Number of Assigned Characters

V 1.0V 1.1V 2.0V 2.1V 3.0 Alphabetics, Symbols 4,748 6,309 6,509 6,511 10,236 Han (URO) 20,902 20,902 20,902 20,902 20,902 Han Extension A 6,582 Han Compatibility 302 302 302 302 302 Hangul Syllables 2,350 6,656 11,172 11,172 11,172 Total assigned characters 28,302 34,169 38,885 38,887 49,194 Private Use 5,632 6,400 6,400 6,400 6,400 Surrogates 2,048 2,048 2,048 Controls 65 65 65 65 65 Not Characters 22222

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 973 D.2 Changes from Unicode Version 2.0 to Version 2.1 Changes from Unicode Version 2.0

Table D-2. Number of Assigned Characters (Continued)

V 1.0V 1.1V 2.0V 2.1V 3.0 Total assigned 16-bit code values 34,001 40,636 47,400 47,402 57,709 Unassigned 16-bit code values 31,535 24,900 18,136 18,134 7,827

This appendix enumerates updates to conformance criteria and to character content and semantics made to the Unicode Standard, Version 2.1, and to Version 3.0. For further infor- mation on all major, minor, and update versions of the Unicode Standard, see the Unicode Web site at http://www.unicode.org/unicode/standard/versions/.

D.2 Changes from Unicode Version 2.0 to Version 2.1

New Characters Added Two new characters were added to the Unicode Standard, Version 2.1: U+20AC   U+FFFC   

Character Semantics Changes Significant clarifications or modifications to character semantics include the following: Apostrophe. Because the character U+0027  is very ambiguous, the preferred character for apostrophe was documented as either U+02BC   -  or U+2019    . Bidirectional Properties. Certain characters were given new bidirectional properties defini- tions. U+0026  and U+0040   were changed from left to right (L) to other neutral (ON). U+002E   and U+2007   were changed from European number separator (ES) to common number separator (CS). Identifier Syntax. Corrections were made to the implementation guidelines for identifier syntax, adding or removing a few characters from the recommended set for inclusion in identifiers. Mathematical and Letter Properties. A number of characters were given the mathematical property; see the list found in Unicode Technical Report #8, “The Unicode Standard, Ver- sion 2.1,” on the CD-ROM or the up-to-date version on the Unicode Web site for details. Two characters, U+02BC    and U+055A  - , were removed from the list of the letter property.

Changes Affecting Conformance Overall Unicode conformance criteria as described in Chapter 3 of Version 2.0 were unchanged. Aspects of the Unicode bidirectional algorithm were modified, Hangul syllable decompositions were clarified, and some normative character property values were changed. For details of these changes see Unicode Technical Report #8, “The Unicode Stan- dard, Version 2.1,” on the CD-ROM or the up-to-date version on the Unicode Web site.

974 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Changes from Unicode Version 2.0 D.3 Changes from Unicode Version 2.1 to Version 3.0

D.3 Changes from Unicode Version 2.1 to Version 3.0

New Characters Added In Version 3.0 of the Unicode Standard, 10,307 new characters have been added, as shown in Table D-3. Table D-3. New Characters Added

Allocation Count Character Name 01F6..01F9 4                     0218..021F 8                                                 0222..0233 18                                                                                                                           02A9..02AD 5                        02DF, 02EA..02EE 6                          

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 975 D.3 Changes from Unicode Version 2.1 to Version 3.0 Changes from Unicode Version 2.0

Table D-3. New Characters Added (Continued)

Allocation Count Character Name 0346..034E, 0362 10                                           03D7 1    03DB, 03DD, 03DF, 03E1 4                 0400, 040D, 0450, 045D 4                         0488..0489 2        048E..048F, 04EC..04ED 4                         058A 1   0653..0655, 06B8..06B9, 12    06BF, 06CF, 06FA..06FE                                                           0700..074A 71 Syriac 0780..07B1 50 Thaana 0D80..0DFF 80 Sinhala 0F6A, 0F96, 0FAE..0FCF 25 Tibetan Extensions 1000..1059 78 Myanmar 1200..137F 346 Ethiopic 13A0..13FF 85 Cherokee 1401..1676 630 Canadian Syllabics 1680..169F 29 Ogham 16A0..16FF 81 Runic 1780..17E9 103 Khmer 1800..18A9 155 Mongolian

976 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Changes from Unicode Version 2.0 D.3 Changes from Unicode Version 2.1 to Version 3.0

Table D-3. New Characters Added (Continued)

Allocation Count Character Name 202F, 2048..204D 7  -                    20AD..20AF 3       20E2..20E3 2       2139..213A 2      2183 1      21EB..21F3 9                                                   2301, 237B, 237D, 5   237E..237F              2380..238C 13                                         238D..239A 14     -- -  -- -  ---  ---       -                    2425..2426 2          

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 977 D.3 Changes from Unicode Version 2.1 to Version 3.0 Changes from Unicode Version 2.0

Table D-3. New Characters Added (Continued)

Allocation Count Character Name 25F0..25F7 8                                                 2619, 2670..2671 3            2800..28FF 256 Braille Pattern Symbols 2E80..2EF3 115 CJK Radicals Supplement 2F00..2FD5 214 KangXi radicals 2FF0..2FFB 12                                                                             3038..303A, 303E 4             31A0..31B7 24 Extended Bopomofo 3400..4DFF 6,582 CJK Unified Ideographs, Extension A A000..A48C 1,165 Yi A490..A4C1 50 Yi radicals FB1D 1      FFF9..FFFB 3         

Character Semantics Changes Significant clarifications or modifications to character semantics include the following: Bidirectional Properties. Bidirectional properties were made more consistent with the gen- eral category property, and new bidirectional properties were created. See Section 3.12, Bidirectional Behavior. Byte Order Mark. The use of the byte order mark with transformation formats was clari- fied. See Section 3.8, Transformations.

978 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Changes from Unicode Version 2.0 D.3 Changes from Unicode Version 2.1 to Version 3.0

Capital Letters with Iota Adscript. The representative glyphs, semantics, case mappings, and decompositions have been revised to make their handling more consistent. Case. Case properties have been extended for those situations where there is a mapping to multiple characters and where case is locale-dependent. See the SpecialCasing.txt file on the CD-ROM. Combining Classes. These classes were updated significantly to resolve problems of normal- ization and decomposition for Indic scripts in particular. See Table 4-3, Combining Classes. Decompositions. Unicode character decompositions have been significantly updated to fix errors in the original assignments, to allow correct collation weighting, and to make decompositions consistent for normalization. Eyelash Ra. Consonant RA rules have been updated and expanded. See rules R5 and R5a in Section 9.1, Devanagari. Figure Space. U+2007   is no longer treated like a numeric separator for pur- poses of bidirectional layout. See Section 6.1, General Punctuation, for a description of its tabular width characteristics. General Category. A series of General Category changes were made to assist the convergence of the Unicode definition of identifier with ISO TR 10176. Identifier Syntax. The list of recommended characters for computer language identifiers was corrected again, and the syntax for identifiers was further simplified. (See Section 5.16, Identifiers.) Layout Controls. The description of layout controls was enhanced to include the behavior of U+00A0 - , U+00AD  , and zero width spaces. See Section 13.2, Layout Controls. Line and Paragraph Separators. Use of line and paragraph separators is clarified in Section 3.12, Bidirectional Behavior, and in Section 13.1, Control Codes. Newlines. Line-handling characteristics have been documented more fully for Unicode environments. Discussions on CR and LF can be found in Section 3.12, Bidirectional Behav- ior; Section 5.9, Line Handling; Section 13.1, Control Codes; and in Unicode Technical Report #13, “Unicode Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site. Quotation Marks. Two new punctuation categories, Pi and Pf, were created for initial and final quotes. The use of these categories and language-based usage of quotation marks is clarified in Section 6.1, General Punctuation. Script Capital p. The Weierstrass elliptic function symbol, U+2118   , actu- ally has the form of a lowercase calligraphic p despite the use of capital in its name. Symmetric Swapping. The symmetric swapping value for guillemets was corrected. Tibetan. Semantics of many Tibetan characters have been clarified or revised. See Section 9.13, Tibetan. Tilde. The use of U+007E  as a spacing clone of combining tilde and as a regular char- acter is described in Section 6.1, General Punctuation.

Changes Affecting Conformance Conformance clauses, definitions, and explanatory text were added for handling Unicode Transformation Formats. The Unicode bidirectional algorithm rules were clarified and expanded, and new bidirectional character properties were documented. Other normative

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 979 D.3 Changes from Unicode Version 2.1 to Version 3.0 Changes from Unicode Version 2.0 character property values were changed; see the Unicode Character Database found on the CD-ROM for details.

Unicode Technical Reports • UTR #11: East Asian Width, Version 5.0 • UTR #13: Unicode Newline Guidelines, Version 5.0 • UTR #14: Line Breaking Properties, Version 6.0 • UTR #15: Unicode Normalization Forms, Version 18.0

980 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. References R

Citations are given for the standards and dictionaries that were used as the actual resources for The Unicode Standard, primarily for Version 1.0. Where a Draft International Standard is known to have progressed to International Standard status, the entry for the DIS has been revised. A revised or reaffirmed edition, with a different date, may have been published subsequently. For the current version of a standard, see the catalog issued by the standards organization. The Web site of the International Organization for Standardization (http:// www.iso.ch/) includes the ISO Catalogue and links to the sites of member organizations. Many of the ISO character set standards were originally developed by ECMA and are also ECMA standards. In general, American library practice has been followed for the romanization of titles and names written in non-Roman script. Exceptions are when information supplied by an author had to be used because the name or title in the original script was unavailable.

R.1 Source Standards ANSI X3.4: American National Standards Institute. Coded character set—7-bit American national standard code for information interchange. , 1986. (ANSI X3.4-1986). ANSI X3.32: American National Standards Institute. American national standard graphic representation of the control characters of American national standard code for information interchange. New York, 1973. (ANSI X3.32-1973). ANSI Y10.20: American National Standards Institute. Mathematic signs and symbols for use in physical sciences and technology. New York, 1988. (ANSI Y10.20-1975 (R1988)). ANSI Z39.47: American National Standards Institute. Extended Latin alphabet coded char- acter set for bibliographic use. New York, 1985. (ANSI Z39.47-1985). ANSI Z39.64: American National Standards Institute. East Asian character code for biblio- graphic use. New Brunswick, Transaction, 1991. (ANSI Z39.64-1989). ASMO 449: Arab Organization for Standardization and Metrology. Data processing 7-bit coded character set for information interchange. [s.l.], 1983. (Arab standard specifications; 449-1982). Authorized English translation. CCCII: Zhongwen Zixun Jiaohuanma (Chinese Character Code for Information Inter- change). Revised edition. Taipei, Xingzhengyuan Wenhua Jianshe Xiaozu (Executive Yuan Committee for Cultural Construction), 1985. CNS 11643-1986: Tongyong hanzi biaozhun jiaohuanma (Han character standard inter- change code for general use). Taipei, Xingzhengyuan (Executive Yuan), 1986. CNS 11643-1992: Zhongwen biaozhun jiaohuanma (Chinese standard interchange code). Taipei, 1992. Obsoletes 1986 edition. EACC: (See ANSI Z39.64.) ECMA Registry: (See ISO Register.)

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 999 R.1 Source Standards References

GB 2312: Xinxi jiaohuanyong hanzi bianmaji, jibenji (Code of Chinese graphic character set for information interchange, primary set). Beijing, Jishu Biaozhun Chubanshe (Technical Standards Press), 1981. (GB 2312-1980). GB 12345: Xinxi jiaohuanyong hanzi bianmaji, fuzhuji (Code of Chinese set for information interchange, supplementary set). Beijing, Jishu Biaozhun Chubanshe (Technical Standards Press), 1990. (GB 12345-1990). GB 13000: Xinxi jishu—Tongyong duobawei bianma zifuji (UCS)—Diyi bufen: Tixi jiegou yu jiben duowenzhong pingmian (Information technology—Universal multiple-octet coded character set (UCS)—Part 1: Architecture and basic multilingual plane). Beijing, Jishu Biaozhun Chubanshe (Technical Standards Press), 1993. (GB 13000.1-93) (ISO/IEC 10646.1-1993). GB 13134: Xinxi jiaohuanyong yiwen bianma zifuji (Yi coded character set for information interchange), [prepared by] Sichuansheng Minzushiwu Weiyuanhui. Beijing, Jishu Biaozhun Chubanshe (Technical Standards Press), 1991. (GB 13134-1991). GBK: Xinxi jiaohuanyong hanzi bianma kuozhan guifan (Extended Code of Chinese graphic character set for information interchange). Beijing, Zhongguo dianzi gongyebu [and] Guojiao jishu jianduju, 1995. The Chinese-specific subset of GB 13000.1-93 GBK is implemented on Windows platforms as Code Page 936. ISCII: India Department of Electronics. Indian standard code for information interchange. New Delhi, 1988. ISO Register: International Organization for Standardization. ISO international register of character sets to be used with escape sequences. [], 1990. The current register is available at http://www.itscj.ipsj.or.jp/ISO-IR/. ISO/IEC 646: International Organization for Standardization. Information technology— ISO 7-bit coded character set for information interchange. [Geneva], 1991. (ISO/IEC 646:1991). ISO 2047: International Organization for Standardization. Information processing—Graph- ical representations for the control characters of the 7-bit coded character set. [Geneva], 1975. (ISO 2047:1975). ISO/IEC 2022: International Organization for Standardization. Information processing— ISO 7-bit and 8-bit coded character sets—Code extension techniques. 3d ed. [Geneva], 1986. (ISO 2022:1994). Edition 4 (ISO/IEC 2022:1994) has title: Information technology—Character code structure and extension techniques. ISO 2033: International Organization for Standardization. Information processing—Coding of machine-readable characters (MICR and OCR). 2d ed. [Geneva], 1983. (ISO 2033:1983). ISO/IEC 4873: International Organization for Standardization. Information technology— ISO 8-bit code for information interchange—Structure and rules for implementation. [Geneva], 1991. (ISO/IEC 4873:1991). ISO 5426: International Organization for Standardization. Extension of the Latin alphabet coded character set for bibliographic information interchange. 2d ed. [Geneva], 1983. (ISO 5426:1983). ISO 5426-2: International Organization for Standardization. Information and documenta- tion—Extension of the Latin alphabet coded character set for bibliographic information

1000 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard References R.1 Source Standards interchange—Part 2: Latin characters used in minor European languages and obsolete typog- raphy. [Geneva], 1996. (ISO 5426-2:1986). ISO 5427: International Organization for Standardization. Extension of the Cyrillic alphabet coded character set for bibliographic information interchange. [Geneva], 1984. (ISO 5427:1984). ISO 5428: International Organization for Standardization. Greek alphabet coded character set for bibliographic information interchange. [Geneva], 1984. (ISO 5428-1984). ISO/IEC 6429: International Organization for Standardization. Information technology— Control functions for coded character sets. 3d. ed. [Geneva], 1992. (ISO/IEC 6429:1992). ISO 6438: International Organization for Standardization. Documentation—African coded character set for bibliographic information interchange. [Geneva], 1983. (ISO 6438:1983). ISO 6862: International Organization for Standardization. Information and documen- tation—Mathematics character set for bibliographic information interchange. [Geneva], 1996. (ISO 6862:1996). ISO/IEC 6937: International Organization for Standardization. Information processing— Coded character sets for text communication. [Geneva], 1984. Edition 2 (ISO/IEC 6937:1994) has the following title: Information technology—Coded graphic set for text communication—Latin alphabet. ISO/IEC 8859: International Organization for Standardization. Information processing— 8-bit single-byte coded graphic character sets. [Geneva], 1987–. These parts of ISO/IEC 8859 predate the Unicode Standard and were used as resources: Part 1, Latin alphabet no. 1; Part 2, Latin alphabet no. 2; Part 3, Latin alphabet no. 3; Part 4, Latin alpha- no. 4; Part 5, Latin/Cyrillic alphabet; Part 6, Latin/Arabic alphabet; Part 7, Latin/Greek alpha- bet; Part 8, Latin/Hebrew alphabet; Part 9, Latin alphabet no. 5. ISO 8879: International Organization for Standardization. Information processing—Text and office systems—Standard generalized markup language (SGML). [Geneva], 1986. (ISO 8879:1986). ISO 8957: International Organization for Standardization. Information and documenta- tion—Hebrew alphabet coded character sets for bibliographic information interchange. [Geneva], 1996. (ISO 8957:1996). ISO 9036: International Organization for Standardization. Information processing—Arabic 7-bit coded character set for information interchange. [Geneva], 1987. (ISO 9036:1987). ISO/IEC 10367: International Organization for Standardization. Information technology— Standardized coded graphic character sets for use in 8-bit codes. [Geneva], 1991. (ISO/IEC 10367:1991). ISO 10585: International Organization for Standardization. Information and documenta- tion— coded character set for bibliographic information interchange. [Geneva], 1996. (ISO 10585:1996). ISO 10586: International Organization for Standardization. Information and docu- mentation—Georgian alphabet coded character set for bibliographic information interchange. [Geneva], 1996. (ISO 10586:1996). ISO/IEC 10646: International Organization for Standardization. Information Technology— Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multi- lingual Plane. [Geneva], 1993. (ISO/IEC 10646-1:1993).

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 1001 R.2 Source Dictionaries for Han Unification References

ISO 10754: International Organization for Standardization. Information and documenta- tion—Extension of the Cyrillic alphabet coded character set for non-Slavic languages for bib- liographic information interchange. [Geneva], 1996 (ISO 10754:1996). ISO DIS 11548-1: International Organization for Standardization. Communication aids for blind persons—Identifiers, names and assignation to coded character sets for 8-dot Braille characters—Part 1: General guidelines for Braille identifiers and shift marks. [Geneva], 1997. (ISO DIS 11548-1). ISO DIS 11548-2: International Organization for Standardization. Communication aids for blind persons—Identifiers, names and assignation to coded character sets for 8-dot Braille characters—Part 2: Latin alphabet based character sets. [Geneva], 1997. (ISO DIS 11548-2). ISO 11822: International Organization for Standardization. Information and docu- mentation—Extension of the Arabic coded character set for bibliographic information inter- change. [Geneva], 1996. (ISO 11822:1996). JIS X 0208: Japanese Standards Association. Jouhou koukan you kanji fugoukei (Code of the Japanese Graphic Character Set for Information Interchange). Tokyo, 1990. (JIS X 0208- 1990). Revision of the 1983 edition. JIS X 0212: Japanese Standards Association. Jouhou koukan you kanji fugou-hojo kanji (Code of the supplementary Japanese graphic character set for information interchange). Tokyo, 1990. (JIS X 0212-1990). 1990 extensions to JIS. JIS X 0221: Information Technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane. Tokyo, 1995. (JIS X0221-1995). KS C 5601-1987: Korea Industrial Standards Association. Chongbo kyohwanyong puho (Han’gul mit Hancha). Seoul, 1989. (KS C 5601-1987). KS C 5601-1992: Korea Industrial Standards Association. Chongbo kyohwanyong puho (Han’gul mit Hancha). Seoul, 1992. (KS C 5601-1992). KS C 5657: Korea Industrial Standards Association. Chongbo kyohwanyong puho hwakchang set’u. Seoul, 1991. (KS C 5657-1991). SI 1311.1: Standards Institution of Israel. Information technology: ISO 8-bit coded character set with Hebrew points. [Tel Aviv, 1996] (SI 1311.1 (1996)). SI 1311.2: Standards Institution of Israel. Information technology: ISO 8-bit coded character set with Hebrew accents. [Tel Aviv, 1996] (SI 1311.2 (1996)). TIS 620-2529: Thai Industrial Standards Institute, Ministry of Industry. Thai Industrial Standard for Thai Character Code for Computer. Bangkok, 1986. (TIS 620-2529–1986). TIS 620-2533: Thai Industrial Standards Institute, Standard for Thai Character Codes for Computers. Bangkok, 1990. (TIS 620-253 –1990). ISBN 974-606-153-4. In Thai. Online version at http://www.nectec.or.th/it-standards/std620/std620.htm.

R.2 Source Dictionaries for Han Unification Dae Jaweon. Seoul, Samseong Publishing Co. Ltd., 1988. Dai Kan-Wa Jiten / Morohashi Tetsuji cho. Shu teiban. Tokyo, Taishukan Shoten, Showa 59-61 [1984–86].

1002 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard References R.3 Other Sources for the Unicode Standard

Hanyu Da Zidian. 1st edition. Chendu, Sichuan Cishu Publishing, 1986. KangXi Zidian. 7th edition. Beijing, Zhonghua Bookstore, 1989.

R.3 Other Sources for the Unicode Standard ALA-LC romanization tables: transliteration schemes for non-Roman scripts, approved by the Library of Congress and the American Library Association. Tables compiled and edited by Randall K. Barry. Washington, Library of Congress, 1997. ISBN 0-8444-0940-5. Bburx Ddie Su (= Bian Xiezhe). Nuo-su bbur-ma shep jie zzit: Syp-chuo nuo bbur-ma syt mu curx su niep sha zho ddop ma bbur-ma syt mu yuop hop, Bburx Ddie da Su. = Yi wen jian zi ben: Yi Han wen duizhao ban. Chengdu: Sìchuan minzu chubanshe, 1984. Bburx Ddie Su. Nip huo bbur-ma ssix jie: Nip huo bbur-ma ssi jie Bburx Ddie curx Su. = Yi Han zidian. Chengdu: Sìchuan minzu chubanshe, 1990. ISBN 7-5409-0128-4 Canadian Aboriginal Syllabic Encoding Committee. Repertoire of unified Canadian aborig- inal syllabics proposed for inclusion into ISO/IEC 10646: International standard universal multiple-octet coded character set. [Canada]: CASEC [1994]. ‘uro: one currency for Europe. http://europa.eu.int/euro/html/entry.html Journal of the International Phonetic Association, 24.2 (1994): 95–98, and 25.1 (1995): 21. Pullum, Geoffrey K. Phonetic Symbol Guide / Geoffrey K. Pullum and William A. Ladusaw. 2nd ed. Chicago, University of Chicago Press, 1996. ISBN 0-226-68535-7; 0-226-68536-5 (pbk.). Pullum, Geoffrey K. Remarks on the 1989 revision of the International Phonetic Alphabet. Journal of the International Phonetic Association, 20:1 (1990), 33–40. RFC 2152. UTF-7: A mail-safe transformation format of Unicode, by D. Goldsmith (Apple Computer, Inc.) and M. Davis (Taligent, Inc.). May 1997. http://www.rfc-editor.org (and search for the RFC)or ftp://ftp.isi.edu/in-notes/rfc2152.txt RFC 2279. UTF-8: A transformation format of ISO 10646, by F. Yergeau (Alis Technologies). January 1998. http://www.rfc-editor.org (and search for the RFC) or ftp://ftp.isi.edu/in-notes/rfc2279.txt Samaranayake V. K. A Standard Code for Information Interchange in Sinhalese, by V. K. Samaranayake and S. T. Nandasara. 1990. (ISO/IEC JTC1/SC2/WG2 N673).

R.4 Selected Resources

Alexander, J. T. A Dictionary of the Cherokee Indian Language. [Sperry, Oklahoma?] Pub- lished by the author, 1971. Allworth, Edward. Nationalities of the Soviet East: Publications and Writing Systems. New York, Columbia University Press, 1971. ISBN 0-231-03274-9. Alphabete und Schriftzeichen des Morgen- und Abendlandes. 2. übearb, u. erw. Aufl. Berlin, Bundesdruckerei, 1969.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 1003 R.4 Selected Resources References

The Alphabet Makers: A Presentation from the Museum of the Alphabet. Waxhaw, North Carolina. 2nd ed. Huntington Beach, California, Summer Institute of Linguistics, 1991. ISBN 0-938978-13-6. Armbruster, Carl Hubert. Initia Amharica: An Introduction to Spoken Amharic. Cambridge, Cambridge University Press, 1908–1920. Bergsträsser, Gotthelf. Introduction to the Semitic Languages: Text Specimens and Grammat- ical Sketches. Translated ... with an appendix on the scripts by Peter T. Daniels. [2d. ed.] Winona Lake, Eisenbrauns, 1995. ISBN 0-931464-10-2. Translation of Einführung in die semitschen Sprachen, 1928. A Bibliography on Writing and Written Language, edited by Konrad Ehlich, Florian Coul- mas, Gabriele Graefen. Berlin, New York, Mouton Gruyter, 1996. (Trends in linguistics. Studies and monographs, 89.) ISBN 3-11-010158-0. Contains references to about 27,500 publications covering mainly 1930–1992. The Book of a Thousand Tongues, by Eugene A. Nida. Rev. ed. London, United Bible Society, 1972. First ed. by Eric M. North, 1938. Bringhurst, Robert. The Elements of Typographic Style. 2d ed. Point Roberts, Washington, Hartley & Marks, 1996. ISBN 0-88179-133-4; 0-88179-132-6 (pbk.). Campbell, George L. Compendium of the World’s Languages. London, Routledge, 1990. ISBN 0-415-06937-6 (set); 0-415-06978-5 (v.1); 0-415-06979-3 (v.2). Clews, John. Language Automation Worldwide: The Development of Character Set Standards. Harrogate (North Yorkshire), SESAME Computer Projects, 1988. ISBN 1-87009-501-4. Comrie, Bernard, ed. The World’s Major Languages. Oxford, Oxford University Press, 1987. ISBN 0-19-520521-9; 0-19-506511-5 (pbk.). Comrie, Bernard, ed. The Languages of the Soviet Union. Cambridge, Cambridge University Press, 1981. ISBN 0-521-23230-9; 0-521-9877-6 (pbk.). Coulmas, Florian. The Blackwell Encyclopedia of Writing Systems. Cambridge, Massachu- setts, Blackwell, 1996. ISBN 0-631-19446-0. Coulmas, Florian. The Writing Systems of the World. Oxford, New York, Blackwell, 1989. Daniels, Peter T. The World’s Writing Systems. (See World’s Writing Systems.) DeFrancis, John. : The Diverse Oneness of Writing Systems. Honolulu, Univer- sity of Hawaii Press, 1989. ISBN 0-8248-1207-7. Diringer, David. The Alphabet: A Key to the History of Mankind. 3d ed., completely rev. with the assistance of Reinhold Regensburger. New York, Funk and Wagnalls, 1968. Also published: London, Hutchinson. ISBN 0-906764-0-8. Diringer, David. Writing. London, Thames and Hudson, 1962. Also published: New York, Praeger. Dixon, Robert M. W. The Languages of Australia. Cambridge, Cambridge University Press, 1980. ISBN 0-521-22329-6. Drucker, Johanna. The Alphabetic Labyrinth: The Letters in History and Imagination. Lon- don, Thames & Hudson, 1999. ISBN 0-500-28068-1. Endo, Shotoku. Hayawakari Chugoku Kantai-ji. Tokyo, Kokusho Kankokai, Showa 61 [1986].

1004 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard References R.4 Selected Resources

Esling, John. Computer Coding of the IPA: Supplementary Report. Journal of the Interna- tional Phonetic Association, 20:1 (1990), 22–26. Everson, Michael. Michael Everson’s Cool Bibliography of Typography and Scripts. http://www.egt.ie/standards/iso15924/document/scriptbib.html Extensible Markup Language (XML), 1.0. (W3C Recommendation 10 February 1998). (REC--19980210). Editors: Tim Bray, Jean Paoli, [and] C. M. Sperberg-McQueen. http://www.w3.org/TR/1998/REC-xml-19980210 Faulmann, Carl. Schriftzeichen und Alphabete aller Zeiten und Völker. Reprint of Das Buch Der Schrift ..., 1880. Augsburg, Augustus Verlag, 1990. ISBN 3-8043-0142-8. Flanagan, David. Java in a Nutshell. 2d. ed. Sebastopol, California, O’Reilly, 1997. ISBN 1- 565-92262-X. Friedrich, Johannes. Geschichte der Schrift. Heidelberg, C. Winter, 1966. Friesen, Otto von. Runorna. Stockholm, A. Bonnier [1933]. (Nordisk kultur, 6). Garneau, Denis. Keys to Sort and Search for Culturally-Expected Results. [s.l.] IBM, 1990. (IBM document number GG24-3516, June 1, 1990). Gaur, Albertine. A . Rev. ed. New York, Cross River Press, 1992. ISBN 1- 558-59358-6. Also published: London, . ISBN 0-7123-0270-0. Geiger, Wilhelm. Maldivian Linguistic Studies. New Delhi, Asian Educational Services, 1996. ISBN 81-206-1201-9. Originally published: Colombo, H. C. Cottle, Govt. Printer, 1919. Gelb, Ignace J. A Study of Writing. Rev. ed. Chicago, University of Chicago Press, 1963. ISBN 0-226-28605-3; 0-226-28606-1 (pbk.). Giliarevskii, Rudzhero Sergeevich. Languages Identification Guide, by Rudzhero S. Gil- yarevsky and Vladimir S. Grivnin. Moscow, Nauka, 1970. The Gospel in Many Tongues: Specimens of 875 Languages ... London, British and Foreign Bible Society, 1965. Gunasekara, Abraham Mendis. A Comprehensive Grammar of the Sinhalese Language. New Delhi, Asian Education Services, 1986. Reprint of 1891 edition. Haarmann, Harald. Universalgeschichte der Schrift. Frankfurt, Campus Verlag, 1990. ISBN 3-593-34346-0. Habein, Yaeko Sato. The History of the Japanese Written Language. Tokyo, University of Tokyo Press, 1984. ISBN 0-86008-347-0; 4-13-087047-5. Haugen, Einar Ingvald. The Scandinavian Languages: an Introduction to their History. Lon- don, Faber, 1976. ISBN 0-571-10423-1. Also published: Cambridge, Massachusetts, Harvard University Press, 1976. ISBN 0-674-79002-2. Healey, John F. The Early Alphabet. Berkeley, University of California Press; [London] Brit- ish Museum, 1990. (Reading the Past, 9). ISBN 0-7309-6. Publications edition has ISBN 0-7141-8073-4. Holmes, Ruth Bradley. Beginning Cherokee, by Ruth Bradley Holmes and Betty Sharp Smith. 2d ed. Norman, University of Oklahoma Press, 1977. ISBN 0-8061-1464-9.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 1005 R.4 Selected Resources References

Huang, K’o-tung. An Introduction to Chinese, Japanese and Korean Computing, by Jack K. T. Huang and Timothy D. Huang. Singapore, Teaneck, New Jersey, World Scientific, 1989. ISBN 9971-50-664-5. IBM. National Language Support Reference Manual. 4th ed. North York, Ontario, IBM Can- ada Ltd., National Language Technical Center, 1994. (National Language Design Guide, 2) (SE09-8002-03). Ifrah, Georges. From One to Zero, a Universal History of Numbers. New York, Penguin, 1987. ISBN 0-14-009919-0. Translation of Histoire universelle des chiffres. Also published: New York, Viking, 1985. ISBN 0- 670-37395-8. International Organization for Standardization. Information Technology—An Operational Model for Characters and Glyphs. Geneva, 1998. (ISO/IEC TR 15285:1998). Isaev, Magomet Izmailovich. Sto tridtsat’ ravnopravnykh; o iazykakh narodov SSSR. Moskva, Nauka, 1970. Jensen, Hans. Sign, Symbol and Script: An Account of Man’s Efforts to Write. New York, Put- nam, 1969. Translation of Die Schrift in Vergangenheit und Gegenwart (Berlin, Deutscher Verlag, 1969). Also published: London, Allen & Unwin. ISBN 0-04-400021-9. Katzner, Kenneth. The Languages of the World. New ed. London, Routledge, 1995. ISBN 0- 415-11809-3. Kefarnissy, Paul. Grammaire de la langue Araméenne syriaque. Beyrouth, 1962.

Knuth, Donald E. The ΤΕΧbook. 21st. printing, rev. Reading, Massachusetts, Addison-Wes- ley, 1994. ISBN 0-201-134489. Ksar, Michael Y. “Untying Tongues,” ISO Bulletin (June 1993), 2–8. Launhardt, Johannes. Guide to Learning the Oromo (Galla) Language. Addis Ababa, Laun- hardt [1973?]. Leslau, Wolf. Amharic Textbook. Weisbaden, Harrassowitz; Berkeley, University of Califor- nia Press, 1968. Lunde, Ken. CJKV Information Processing. Beijing [etc.], O’Reilly, 1999. ISBN 1-565-92224-7. Malherbe, Michel. Les Langages de l’Humanité: Une Encyclopédie des 3000 Langues Parlées dans le Monde. Paris, Laffont, 1995. ISBN 2-221-05947-6. Maniku, Hassan Ahmed. Say It in Maldivian (Dhivehi), [by] H. A. Maniku [and] J. B. Dis- anayaka. Colombo, Lake House Investments, 1990. McManus, Damian. A Guide to Ogam. Maynooth, An Sagart, 1991. (Maynooth mono- graphs, 4) ISBN 1-87068-417-6. Muller, Siegfried H. The World’s Living Languages: Basic Facts of Their Structure, Kinship, Location, and Number of Speakers. New York, Ungar, 1964. Musaev, Kenesbai Musaevich. Alfavity iazykov narodov SSSR. Moskva, Nauka, 1965. Musset, Lucien. Introduction à la Runologie. Paris, Aubier-Montaigne, 1965. Naik, Bapurao S. Typography of Devanagari. 1st ed., rev. Bombay, Directorate of Languages, Govt. of Maharashtra, 1971.

1006 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard References R.4 Selected Resources

Nakanishi, Akira. Writing Systems of the World: Alphabets, Syllabaries, . Rutland, Vermont, Tuttle, 1980. ISBN 0-8048-1293-4; 0-8048-1654-9 (pbk.). Revised translation of Sekai no moji. New Echota Letters: Contributions of Samuel A. Worcester to the Cherokee Phoenix, edited by Jack Frederick Kilpatrick and Anna Gritts Kilpatrick. Dallas, Southern Methodist Univer- sity Press, [s.d.]. Includes reprint of an article by S. A. Worcester, which appeared in the Cherokee Phoenix, Feb. 21, 1828. Nida, Eugene A. The Book of a Thousand Tongues. (See Book of a Thousand Tongues.) Nöldeke, Theodor. Compendious Syriac Grammar. With a table of characters by Julius Eut- ing. Translated … from the 2d and improved German ed., by James A. Crichton. London, Williams & Norgate, 1904. Reprinted: Tel Aviv, Zion Pub. Co. [1970]. Nonroman Scripts in the Bibliographic Environment. In Encyclopedia of Library and Infor- mation Science, 56 (supplement 19), 260–283. New York, Dekker, 1995. O’Donnell, Sandra Martin. Programming for the World, a Guide to Internationalization. Englewood Cliffs, New Jersey, PTR Prentice-Hall, 1994. ISBN 0-13-722190-8. Page, Raymond Ian. Runes. Berkeley, University of California Press; [London] British Museum, 1987. (Reading the Past). ISBN 0-520-06114-4. British Museum Publications edition has ISBN 0-7141-8065-3. Pavlenko, Nikolai Andreevich. Istoria Pis’ma. 2. izd. Minsk, Vysshaia Shkola, 1987. Ramsey, S. Robert. The . 2d printing with revisions. Princeton, Prince- ton University Press, 1989. ISBN 0-691-01468-X. Robinson, Andrew. The Story of Writing. London, Thames and Hudson, 1995. ISBN 0-500- 01665-8. Robinson, Theodore Henry. Paradigms and Exercises in Syriac Grammar. 4th ed. rev. by L. H. Brockington. Oxford, Clarendon Press; New York, Oxford University Press, 1962. ISBN 0-19-815416-X, 0-19-815458-5 (pbk.). Roop, D. Haigh. An Introduction to the Burmese Writing System. [Honolulu], Center for Southeast Asian Studies, University of Hawaii at Manoa, 1997. (Southeast Asia paper, 11). Originally published: New Haven, Yale University Press, 1972. (Yale linguistic series). ISBN 0-300- 01528-3. Ruhlen, Merritt. A Guide to the World’s Languages, Volume 1: Classification, with a Postscript on Recent Developments. Stanford, California, Stanford University Press, 1991. ISBN 0- 8047-1894-6 (v. 1). Also published: London, Arnold. ISBN 0-340-56186-6 (v.1). Sampson, Geoffrey. Writing Systems: A Linguistic Introduction. Stanford, California, Stan- ford University Press, 1985. ISBN 0-8047-1254-9. Also published: London, Hutchinson. ISBN 0-09-156980-X; 0-09-173051-1 (pbk.). Senner, Wayne M. The Origins of Writing. Lincoln, University of Nebraska Press, 1989. ISBN 0-8032-4202-6; 0-8032-9167-1 (pbk.). Shepherd, Walter. Shepherd’s Glossary of Graphic Signs And Symbols. Compiled and classi- fied for ready reference by Walter Shepherd. New York, Dover, 1971. ISBN 0-486-20700-5. Also published: London, Dent. ISBN 0-460-03818-4.

The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 1007 R.4 Selected Resources References

Shinmura, Izuru. Kojien / Shinmura Izuru hen. Dai 4-han. Tokyo, Iwanami Shoten, 1991. Stevens, John. Sacred of the East. 3d ed., rev. and expanded. Boston, Shambala, 1995. ISBN 1-570-62122-5. Suarez, Jorge A. The Mesoamerican Indian Languages. Cambridge, Cambridge University Press, 1983. ISBN 0-521-22834-4; 0-521-29669-2 (pbk.). Universal Multiple-Octet Coded Character Set Coexistence and Migration. High Wycombe, X/Open Company, 1994. (X/Open Technical Study, E401.) ISBN 1-85912-031-8. von Ostermann, Georg F. Manual of Foreign Languages. 4th ed., revised and enlarged New York, Central Book Company, 1952. Wemyss, Stanley. The Languages of the World, Ancient and Modern: The Alphabets, Ideo- graphs, and Other Written Characters of the World in Sound and Symbol. Philadelphia, 1950. The World’s Writing Systems. Edited by Peter T. Daniels and William Bright. New York, Oxford University Press, 1995. ISBN 0-19-507993-0. W3C Recommendation. See Extensible Markup Language (XML).

1008 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. I.2 General Index The general index is a guide to the written text only. To find the range for a particular script, see the Contents. To find the code for a specific character, see the preceding Unicode Names Index. For the definition of specific terms, see the Glossary.

A Character pairs Allocation ...... 21–23 see Mirrored property Alphabetic property ...... 102 Character sequences ...... 38 Character shaping selectors ...... 320 Annotation characters ...... 325 Apostrophes ...... 153 Characters ...... 40–42 Arabic ...... 189–198 Concept ...... 13 Duplicate encoding ...... 17 Dual-joining characters (table) ...... 195 Joining classes ...... 192 Editing ...... 116–117 Ligatures ...... 194 Equivalence ...... 136 Interpretation ...... 5, 30–32, 38 Other character joining classes (table) ...... 197 Right-joining characters (table) ...... 196 Missing and unknown ...... 108 Architecture ...... 9–11 Modification ...... 39 Names ...... 101, 972 Armenian ...... 172 Arrows ...... 301 New in 3.0 ...... 3, 975 Number of assigned ...... 973 Properties ...... 42 B Reassigned ...... 109 Bengali ...... 224 Reserved ...... 334 Bidirectional algorithm ...... 55–69 Transcoding ...... 105–107 Bidirectional character types ...... 59 Cherokee ...... 287 Bidirectional ordering codes ...... 318 CJK compatibility ...... 307 Bidirectional placement ...... 121 CJK compatibility forms ...... 156 Block elements ...... 304 CJK enclosed letters and months ...... 307 BNF CJK punctuation ...... 155 see Extended BNF CJK standards ...... 258 Bopomofo ...... 278 Code charts ...... 331–335 Boundaries ...... 116–117, 124–133 Glyphs ...... 332 Grapheme ...... 126–127 Code space assignment ...... 23 Line ...... 129–132 Code value ...... 41 Sentence ...... 132–133 Collation ...... 114, 139 Word ...... 127–128 Combining characters ...... 24–27, 134 Box drawing ...... 304 Canonical ordering ...... 50–52, 75 Braille ...... 308 Definition ...... 43 Burmese Input ...... 118 see Myanmar Rendering ...... 118, 120–123 Byte order mark ...... 28, 324–325 Truncating ...... 119–120 Byte ordering ...... 37 Combining classes ...... 51, 75–85 Combining half marks ...... 181 C Combining marks ...... 179 Canadian Aboriginal Syllabics ...... 288 Compatibility characters ...... 17 Canonical decomposition ...... 44 Compatibility decomposition ...... 44 Canonical ordering behavior ...... 50–52 Compatibility ideographs ...... 267 Cantillation marks ...... 187 Compatibility jamo ...... 275 Caron ...... 162 Compression ...... 113 Case ...... 75 Conformance ...... 30–32, 37–69 Case mappings ...... 75, 141–143 Bidirectional ...... 67 Case pairs ...... 164 Examples ...... 30–32 Cedilla ...... 162 Requirements ...... 37–39 Changes from version 2.0 ...... 973–980 Conjoining jamo ...... 52–55 Character classes ...... xxviii Contextual forms Character codes ...... 12 Positioning ...... 123 Character encoding schemes ...... 21 Control characters ...... 29 Character names list ...... 331 Control codes ...... 314 Control pictures ...... 302

The Unicode Standard 3.0 1037 I.2 General Index Indices

Control sequences ...... 30 H Convertibility ...... 18 Halfwidth forms ...... 273 Currency symbols ...... 297 Han characters ...... 261 Cursive connection ...... 316 Han ideograph arrangement ...... 266 Cyrillic ...... 171 Han radical-stroke index ...... 849 Han unification ...... 961–962 D Language tags ...... 115 Dashes ...... 150 Principles ...... 262–266 Dead keys ...... 118 Sources ...... 259 Decimal digits ...... 89, 110 Handwriting sequence ...... 118 Decomposition ...... 44 Hangul ...... 275–277 Default properties ...... 74 Hangul syllables ...... 53–55, 276 Deprecated format characters ...... 320–321 Collation ...... 275 Design ...... 3–4 Hebrew ...... 186–188 Design principles ...... 12–18 Hebrew alphabetic presentation forms ...... 188 Devanagari ...... 211–223 Hiragana ...... 272 Diacritical marks ...... 27 Hyphenation ...... 10 Diacritics ...... 25, 179–180 Hyphens ...... 150 Combining ...... 179 Double ...... 50, 179 I For symbols ...... 180 IANA charset names ...... 21 Latin script ...... 160 Identifier syntax ...... 133–134 On i and j ...... 163 Ideographic accounting numbers ...... 97 Polytonic Greek ...... 169 Ideographic Description Sequences (IDS) ...... 269 Spacing clones ...... 161 Ideographic numbers ...... 97, 111 Digit names ...... 191 Ideographic property ...... 102 Dingbats ...... 305 Ideographic Variation Mark ...... 270 Directional formatting ...... 56 Ideographs ...... 258–266, 335 Directionality ...... 24 Component structure ...... 264 Directionality (property) ...... 42, 85–86 General characteristics ...... 260 Dynamic composition ...... 18 Radicals ...... 267 Sorting ...... 262 E Unknown or unavailable ...... 155 East Asian scripts ...... 257–280 see also Han unification Enclosed alphanumerics ...... 307 Implementation guidelines ...... 105–143 Enclosing marks ...... 50 Informative properties ...... 73–74 Encoding ...... 11 International Phonetic Alphabet Encoding forms ...... 19–21 see IPA extensions In ISO/IEC 10646 ...... 969 Invalid code values ...... 38 Equivalence ...... 136 IPA extensions ...... 164 Equivalent sequence ...... 18 ISO/IEC 10646 ...... 5, 967–972 Ethiopic ...... 284–286 Euro sign ...... 297 J European alphabetic scripts ...... 159–181 Jamo ...... 52–55, 275 Extended BNF ...... xxviii Short name (property) ...... 86–87 Joiner ...... 190 F Jungseong ...... 276 Fancy text ...... 15 Justification ...... 122 Format characters ...... 313–327 Format control characters ...... 29, 134 K Fullwidth forms ...... 273 Kanbun ...... 267 Kannada ...... 234 G Katakana ...... 273 General category ...... 87–88 Kerning ...... 123 Geometric shapes ...... 304 Khmer ...... 251–253 Geometrical symbols ...... 304 Georgian ...... 173 L Glyphs ...... 162, 332 Language information ...... 114–116 Grapheme ...... 10 Language tagging ...... 114 Greek ...... 167–170 Lao ...... 239 Gujarati ...... 226 Latin script ...... 160–166 Gurmukhi ...... 225 Layout control characters ...... 29, 134, 315–319 Letter (informative property) ...... 102

1038 The Unicode Standard 3.0 Indices I.2 General Index

Letter spacing ...... 316 Proposals for new scripts and characters . . . 963–965 Letterlike symbols ...... 298 Punctuation ...... 147–156 Liaison members ...... 6 Ligatures ...... 27, 318 Q Arabic ...... 194 Quotation marks ...... 151 Devanagari ...... 221 Latin ...... 166 Positioning ...... 122 R Syriac ...... 205 Radicals ...... 267 Line breaking ...... 315 Yi ...... 280 Line handling ...... 113 Random access ...... 133 Line separator ...... 29 Reassigned characters ...... 109 Logical order ...... 16 Regular expressions ...... 114 Rendering ...... 108, 118 M Replacement characters ...... 29, 326 Reserved characters ...... 334 Malayalam ...... 235 Reserved codes ...... 23 Mandarin tone marks ...... 278 Rich text ...... 15 Mapping tables ...... 107 Runic ...... 174–175 Mathematical operators ...... 300 Mathematical property ...... 101 Matras ...... 212 S Middle Eastern scripts ...... 185–206 Searching ...... 140 Minnan and Hakka tone marks ...... 279 Semantics ...... 15, 40 Mirrored property ...... 42, 97–101 Separators ...... 29 Miscellaneous symbols ...... 305 Shift-JIS index ...... 923 Modifier letters ...... 177 Signature for Unicode data ...... 28, 325 Mongolian ...... 289–291 Sinhala ...... 236 Multiple base characters ...... 27 Small form variants ...... 156 Myanmar ...... 249–250 Sorting ...... 135–140 South and Southeast Asian scripts ...... 209–253 N Space characters ...... 149 Special areas ...... 313–327 Names ...... 86–87, 101, 972 Special character properties ...... 42, 47–50 New character proposals ...... 963–965 Subscripts ...... 299 Newlines ...... 113 Subsets ...... 31–32 Noncharacter values ...... 28 Superscripts ...... 299 Noncharacters ...... 327 Surrogates ...... 21, 45, 322 Nongraphic characters ...... 23 Handling surrogate pairs ...... 109–110 Non-joiner ...... 190 Symbols ...... 295–309 Nonspacing marks ...... 117–120, 120–123 Symmetric swapping ...... 320 Positioning ...... 122 Syriac ...... 199–205 see Combining characters Cursive joining ...... 203 Normalization ...... 112–113 Ligatures ...... 205 Normative properties ...... 40, 73–74 Shaping ...... 203 Notational conventions ...... xxviii–xxx Syriac abbreviation mark ...... 200 Number forms ...... 299 Numbers ...... 110 Ideographic ...... 111 T Naming ...... 190 Table optimization ...... 106 Numeric shape selectors ...... 321 Tables Numeric value ...... 89–96 Multistage ...... 106 Two-stage ...... 106 O Tamil ...... 228–232 Technical symbols ...... 302 Ogham ...... 176 Telugu ...... 233 Optical character recognition ...... 303 Text elements ...... 5, 10, 116 Oriya ...... 227 Text handling ...... 4–5 Text processes ...... 9–11 P Thaana ...... 206 Paired punctuation ...... 149 Thai ...... 237–238 Paragraph separator ...... 29 Tibetan ...... 240–248 Plain text ...... 15 Tilde ...... 149 Private use area ...... 323 Tone letters ...... 178 Properties ...... 42, 73–102, 111–112 Transcoding ...... 105–107 see also individual properties Transformations ...... 39, 45–47

The Unicode Standard 3.0 1039 I.2 General Index Indices

Transmission 7-bit and 8-bit ...... 107 Truncation ...... 119 U UCS see ISO/IEC 10646 Unassigned characters ...... 108 Unassigned codes ...... 23 Unicode 1.0 names ...... 101 Unicode Character Database ...... xxvii Unicode Consortium ...... 6 Addresses ...... xxx ISO membership ...... 6 Liaison members ...... 6 Membership ...... 6 Unicode Consortium Resources ...... xxx Anonymous FTP site ...... xxx Public mailing list ...... xxx Web site ...... xxx Unicode Standard About the Unicode Standard ...... xxv–xxviii Changes from version 2.0 ...... 973–980 Coverage ...... 2–3 Design ...... 3–4 Design principles ...... 12–18 Future expansion ...... 2 New characters ...... 3, 975 Synchronization with ISO/IEC 10646 ...... 971 Technical reports ...... xxvii Versions ...... 32, 973–974 Unicode Technical Committee (UTC) ...... 6 Unification ...... 17 see also Han unification Universal Character Set see ISO/IEC 10646 Unknown characters ...... 108 Unrenderable characters ...... 108 UTF-16 ...... 19, 970 UTF-8 ...... 20, 970 V Versions of the Unicode Standard . . . . . 32, 973–974 Vietnamese letters and tone marks ...... 166 Virama ...... 213 Visual ambiguity ...... 17 W Wide character type ...... 107–108 Word breaking ...... 315 Writing direction ...... 24 Y Yi ...... 280 Yiddish digraphs ...... 187

1040 The Unicode Standard 3.0 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:

Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000.