Programming with the Text Encoding Conversion Manager

Programming With the Text Encoding Conversion Manager May 14, 2003 Druckmaschinen AG, available from Apple Computer, Inc. Linotype Library GmbH. © 2004 Apple Computer, Inc. Java and all Java-based trademarks are All rights reserved. trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other No part of this publication may be countries. reproduced, stored in a retrieval system, or transmitted, in any form or by any means, PowerPC and and the PowerPC logo are mechanical, electronic, photocopying, trademarks of International Business recording, or otherwise, without prior Machines Corporation, used under license written permission of Apple Computer, Inc., therefrom. with the following exceptions: Any person Xerox is a trademark of Xerox Corporation. is hereby authorized to store documentation Simultaneously published in the United on a single computer for personal use only States and Canada. and to print copies of documentation for personal use provided that the Even though Apple has reviewed this manual, APPLE MAKES NO WARRANTY OR documentation contains Apple’s copyright REPRESENTATION, EITHER EXPRESS OR notice. IMPLIED, WITH RESPECT TO THIS MANUAL, ITS QUALITY, ACCURACY, The Apple logo is a trademark of Apple MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. AS A RESULT, THIS Computer, Inc. MANUAL IS SOLD ªAS IS,º AND YOU, THE PURCHASER, ARE ASSUMING THE ENTIRE Use of the “keyboard” Apple logo RISK AS TO ITS QUALITY AND ACCURACY. (Option-Shift-K) for commercial purposes IN NO EVENT WILL APPLE BE LIABLE FOR without the prior written consent of Apple DIRECT, INDIRECT, SPECIAL, INCIDENTAL, may constitute trademark infringement and OR CONSEQUENTIAL DAMAGES RESULTING FROM ANY DEFECT OR unfair competition in violation of federal INACCURACY IN THIS MANUAL, even if and state laws. advised of the possibility of such damages. THE WARRANTY AND REMEDIES SET No licenses, express or implied, are granted FORTH ABOVE ARE EXCLUSIVE AND IN with respect to any of the technology LIEU OF ALL OTHERS, ORAL OR WRITTEN, EXPRESS OR IMPLIED. No Apple dealer, agent, described in this document. Apple retains or employee is authorized to make any all intellectual property rights associated modification, extension, or addition to this with the technology described in this warranty. document. This document is intended to Some states do not allow the exclusion or limitation of implied warranties or liability for assist application developers to develop incidental or consequential damages, so the applications only for Apple-labeled or above limitation or exclusion may not apply to you. This warranty gives you specific legal Apple-licensed computers. rights, and you may also have other rights which Every effort has been made to ensure that vary from state to state. the information in this document is accurate. Apple is not responsible for typographical errors. Apple Computer, Inc. 1 Infinite Loop Cupertino, CA 95014 408-996-1010 Apple, the Apple logo, Carbon, Chicago, Geneva, Mac, Mac OS, Macintosh, Monaco, New York, and TrueType are trademarks of Apple Computer, Inc., registered in the United States and other countries. Adobe, Acrobat, and PostScript are trademarks or registered trademarks of Adobe Systems Incorporated in the U.S. and/or other countries. Helvetica, Palatino, and Times are trademarks of Heidelberger Contents Chapter 1 Introduction to Text Encodings and Conversions 7 Why You Need to Convert Text From One Encoding to Another 8 Deciding Which Encoding Converter to Use 9 The Text Encoding Converter 9 The Unicode Converter 11 Character Encoding and Other Concepts Fundamental to Text Encoding Conversion 11 Characters 12 Coded Character Sets 12 Presentation Forms 12 Character Encoding Schemes 13 Text Encoding Specifications 14 Unicode and the Complexities of Conversion 14 About Unicode 14 Round-Trip Fidelity 15 The Text Encoding Conversion Manager 18 About Earlier Releases 18 Checking the Version 18 Chapter 2 Character Encoding Concepts In-Depth 19 Terminology 19 Character Sets and Encoding Schemes 19 Characters, Glyphs, and Related Terms 20 Non-Unicode Character Encodings 22 General Character Set Structure 22 Simple Coded Character Sets 23 Packing Schemes for Multiple Character Sets 25 Code-Switching Schemes for Multiple Character Sets 26 Unicode 27 Character Set Features 29 Repertoire and Semantics 29 Combining and Conjoining Characters 29 Ordering Issues 31 Character Data in Programming Languages 33 3 © 2004 Apple Computer, Inc. All Rights Reserved. CONTENTS Chapter 3 Writing Custom Plug-Ins 35 Appendix A Character Encodings and Internet Names 59 Identifying Character Encodings on the Internet 59 Character Encodings Masquerading as Related Encodings 60 Character Encodings and Their Internet Names 60 Appendix B Mac OS Encoding Variants 67 Document Version History 71 Glossary 75 4 © 2004 Apple Computer, Inc. All Rights Reserved. Tables and Figures Chapter 1 Introduction to Text Encodings and Conversions 7 Figure 1-1 A possible conversion path used by the Text Encoding Converter 10 Chapter 2 Character Encoding Concepts In-Depth 19 Figure 2-1 Some glyph images for representing characters 21 Figure 2-2 Presentation forms 21 Figure 2-3 Comparison of 7-bit and 8-bit character set structures 23 Figure 2-4 Shift-JIS byte sequence 25 Figure 2-5 Unicode sequence expressed in UTF-16, UTF-8, and UTF-7 28 Figure 2-6 Some combining marks present in Unicode 30 Figure 2-7 Fraction slash and conjoining jamos 31 Figure 2-8 Implicit ordering 32 Figure 2-9 Character sequence and resulting display 32 Appendix A Character Encodings and Internet Names 59 Table A-1 Character encoding Internet names and availability in Mac OS 66 Appendix B Mac OS Encoding Variants 67 Table B-1 Mac OS Encoding Variants 69 5 © 2004 Apple Computer, Inc. All Rights Reserved. TABLES AND FIGURES 6 © 2004 Apple Computer, Inc. All Rights Reserved. CHAPTER 1 Introduction to Text Encodings and Conversions This chapter introduces the Text Encoding Conversion Manager. As a prelude, it explains why text encoding conversion is necessary. Then it describes the Text Encoding Conversion Manager's two main componentsÐthe Text Encoding Converter and the Unicode ConverterÐsuggesting why you should choose one over the other for your conversion processes. The remainder of the chapter explores some of the terms and concepts that pervade text encoding and the process of converting from one encoding to another, including ■ Characters, codes, coded character sets, and character encoding schemes ■ Text representation and text elements ■ Text encoding specifications ■ Unicode, in the context of its emergence as a solution to text encoding complexities ■ Round-trip fidelity, strict and loose mapping, Corporate Use Zone mappings, and fallback mappings Finally, the chapter highlights the Text Encoding Conversion Manager package contents and gives a terse history of its past releases. You should read this chapter if you are developing ■ Internet-savvy applications, such as web browsers or e-mail applications. ■ Applications that transfer text across platforms. ■ Applications based in Unicode, such as a word processor or file system that operates in Unicode. You can find descriptions of the basic text types for specifying text encodings and other aspects of conversion, the Text Encoding Converter, and the Unicode Converter in the following reference documents: Inside Mac OS X: Text Encoding Conversion Manager Reference http://developer.apple.com/documentation/Carbon/Reference/Text_Encodin_sion_Manager/index.html Inside Mac OS X: Unicode Utilities Reference http://developer.apple.com/documentation/Carbon/Reference/Unicode_Utilities_Ref/index.html 7 © 2004 Apple Computer, Inc. All Rights Reserved. CHAPTER 1 Introduction to Text Encodings and Conversions The reference documents are meant to be used as you develop your applications. You can consult the descriptions of data structures and functions to gain a high-level understanding of how to use the converters. For general information about how the Mac OS handles text, see Inside Mac OS X: Handling Unicode Text With MLTE, available from: http://developer.apple.com/documentation/Carbon/Conceptual/HandlingUnicodeText_MLTE/index.html Why You Need to Convert Text From One Encoding to Another This section explains in broad terms why you need to convert text from one encoding used to represent the text to another, and uses terminology fundamental to the text encoding conversion process. These terms and the concepts they represent are explored in depth later in ªCharacter Encoding and Other Concepts Fundamental to Text Encoding Conversionº (page 11) and in ªCharacter Encoding Concepts In-Depthº (page 19). Central to any discussion of text encoding and text encoding conversion is the concept of a character, which is an abstract unit of text context. Characters are often identified with or confused with one or more of the following concepts, but it is important to keep the notion of an abstract character separate from these concepts: ■ A graphic representation corresponding to a character (this graphic representation is what most people think of as the character) ■ A key or key sequence used to input a character ■ A number or number sequence used in a computer system to represent a character In this document we are concerned primarily with abstract characters and with their numeric representation in a computer system. In order to represent textual characters in a file or in a computer's memory, some sort of mapping must be used to assign numeric values

Load more