Character Data Representation Architecture (CDRA)
Total Page:16
File Type:pdf, Size:1020Kb
Character Data Representation Architecture Character Data Representation Architecture (CDRA) is an IBM architecture which defines a set of identifiers, resources and APIs for identifying character data and maintaining data integrity during data conversion processes. The reference publication provides the reader with the framework for the identifiers, descriptions of the supporting resources and APIs, as well as methods for data conversions. CDRA is applicable to country specific data encodings such as the ISO-8859 series as well as large character sets such as Unicode. Edition Notice: Updated in 2018, this document is a revised, on-line version of the original IBM publication. This version of the document contains many typographical and editorial corrections. A section on the Unicode Encoding Structure has been added in Appendix A. Also, Appendix K has been added to document the CDRA identifiers defined in support of Unicode. Likewise, Appendix J, has been updated with information on how to obtain CDRA conversion table resources on-line. Appendix L was added containing the EBCDIC control code definitions. Overview • Chapter 1. Introduction • Chapter 2. Architecture Strategy Architecture Definition • Chapter 3. CDRA Identifiers • Chapter 4. Services • Chapter 5. CDRA Interface Definitions • Chapter 6. Difference Management • Chapter 7. CDRA Resources and Their Management Appendices • Appendix A. Encoding Schemes • Appendix B. Conversion Methods • Appendix C. CCSID Repository • Appendix D. Platform Support of CDRA • Appendix E. Graphic Character Identification Copyright IBM Corp. 1990, 1995, 2018 • Appendix F. Character Sets and Code Pages • Appendix G. Control Character Mappings • Appendix H. CDRA and IBM i (Formerly OS/400) • Appendix I. DFSMS/MVS Considerations • Appendix J. CDRA Conversion Resources • Appendix K. CDRA and Unicode • Appendix L. EBCDIC control character definitions Resources • Glossary Copyright IBM Corp. 1990, 1995, 2018 Chapter 1. Introduction This chapter introduces readers to the Character Data Representation Architecture. The architecture objectives, challenges, coverage and concepts are presented to form the basis for understanding the following chapters. Definition of Character Data Representation Architecture Character Data Representation Architecture (CDRA) is an IBM* architecture that defines a set of identifiers, services, supporting resources, and conventions to achieve consistent representation, processing, and interchange of graphic character data in data processing environments. Objectives The overall objective of CDRA is to define a method of assigning and preserving the meaning and rendering of coded graphic characters through various stages of processing and interchange. The Objectives intro paragraph is followed by these paragraphs that you have already: • Define the necessary supporting services (such as tagging) to allow consistent support of various coded graphic character sets, both within and across environments • Select a minimum number of coded graphic character sets that satisfies most of today's text and data-processing applications and provide the necessary definitions for them so that they can be consistently supported in all applications, services, and devices within and across different environments Data integrity challenges Character Data Representation Architecture is used by products and systems to address the following data integrity challenges: • Proliferation of Different Character Codes The primary problem in handling coded character sets is the variety of sets of characters and encoding schemes used to represent them. Technology has increased the variety of applications using computers, and the coded character support for these applications has been provided without an overall strategy. • Graphic Character Data Recognition The abilities to properly distinguish graphic character data in a universal manner and to attach a tag to the data are available only in some specific architected environments. The available architected methods are often inconsistent, have lagged behind worldwide requirements, and are constrained by the supporting products. • Inconsistent or Incomplete Set of Identifiers Few applications have consistent character support. Operating system environments rarely provide the necessary services to identify coded character data, leaving this responsibility to the applications. • Use of Absolute Values (Hard-coding) Many applications were designed to operate in specific environments with specific terminal characteristics. The internal representations of frequently used characters have been coded using absolute values. Character data misinterpretation occurs when environment changes are made and the initial character processing functions are no longer valid. • Non-tagged Data Traditional data processing environments were closed systems, and the coded character support was primarily governed by the device character handling capabilities. Data was rarely tagged. The most visible end-user impact of all these concerns is in data interchange within and across system environments. For example: • The set of graphic characters supported on the IBM Personal Computer (PC) cannot be fully processed in non-PC environments. (Character set mismatch.) • When a "dollar symbol" is sent in text from a U.S. mainframe computer to a U.K. mainframe computer, it often appears as a "pound sterling symbol" to the U.K. user. (Conversion based on byte integrity.) • When a "lowercase a" is sent to a Katakana terminal in Japan, it appears as a "Katakana character" or gets irreversibly converted to an "uppercase A". • Application and device code page differences may lead to users entering characters from the keyboard that are different from those specified by the application. For example, in Switzerland, the programmer must key "y-acute capital" (Ý) and "dieresis" ( ¨ ) instead of "left square bracket" ( [ ) and "right square bracket" ( ] ), even though the bracket symbols are supported and are engraved on the Swiss keyboard. • The variety of existing code point conversion tables produces inconsistent and often unpredictable results between different environments. See Figure 1. Figure 1. Character data platform domains A character data domain may be described as an environment in which all character data has the same coded representation. This can be shown in a broad sense with respect to each category of system; midrange, mainframe, workstation and personal. Data domains may be the same within a system, or may differ within a system, but typically the view is one of a character data domain per system category. There are well known examples of the problems encountered as character data moves between data domains, leading to character data misrepresentation. Businesses may spend a significant portion of their information technology budgets circumventing, repairing, and educating to resolve the data integrity problems. Coverage of CDRA Character Data Representation Architecture defines: • An identification or tagging system to uniquely and reliably identify the representation of graphic character data • A set of portable Application Programming Interfaces (APIs) • A set of resources in support of the tags and services • A set of conventions on the use of the tags and services • A strategy for coded character set convergence. This coverage is depicted in Figure 2. Figure 2. Components of CDRA. CDRA components are categorized as the data identification mechanism, functions, resources and processing guidelines. Character data focus Data can be classified in many ways, such as character data, byte strings, integer numbers, or floating-point numbers. Character data is further classifiable into control character data and graphic character data. Control characters include, for example, Horizontal Tab and Line Feed, which perform specific functions. Graphic characters include uppercase and lowercase letters (with and without accent marks), numeric digits 0 to 9, ideographs, and other symbols. Graphic character data streams can include embedded code extension controls, such as Shift-Out or Shift-In, used in the interpretation of the data following the controls. Figure 3 shows an example of these classifications. CDRA deals with character data; primarily with graphic character data, and to a nominal extent with control character data. Figure 3. Types of data in a string. Various types of data may be contained in a data string. CDRA focuses on the coded graphic character data. Support for control functions Control functions, as defined by either single control characters or sequences of code points, can appear intermixed with graphic character data. From the CDRA point of view, there are two categories of control functions: • Code extension These functions modify the interpretation of subsequent code points representing graphic characters. Examples are: Shift-Out (SO), Shift-In (SI), Single-Shift 2 (SS2). • All other control functions Applications or architectures are responsible for handling specific control functions. CDRA provides an interface to query a set of control character encodings and uses the code point assignments for SPACE and SUB in its difference management functions. CDRA conversion methods and functions support the concept of string types to handle space-padded and null-terminated strings. All other aspects of control functions are outside the scope of CDRA. Architecture concepts Tagging Tagging is the primary method to identify the meaning and rendering of coded graphic characters.