A Look at Unicode and COBOL by Sasirekha Cota

A Look at Unicode and COBOL By Sasirekha Cota Some IT experts have predicted that COBOL would die out, but some new applications are being built in COBOL.Although Unicode is the default in Java, it is now entering the COBOL world only.This article defines Unicode, explains what COBOL offers in terms of Unicode support, factors to consider before using Unicode, and the benefits of using it. OVERVIEW acters to numbers are done using different With documents getting transferred elec- encoding systems. tronically rather than the printed media across Industry pundits have been predicting the The 7-bit ASCII was not sufficient to rep- countries that use different alphabets, the end of COBOL for the past two decades, resent all the characters used by many of the problems like accented Latin characters get- whereas a recent survey shows that there are world’s languages. Multiple code pages are ting replaced by Greek characters arise. between 180 billion and 200 billion lines of created to meet this challenge. DBCS are Similar problems arise when documents are COBOL code in use worldwide. Before you used for handling languages like Chinese, moved across operating systems. conclude that COBOL is not the language of Japanese and Korean with thousands of Unicode that assigns a unique number to choice for new development, note that an esti- ideographs. DBCSs are encoded in such a each character in each of the major languages mated 15% of all new applications are being way as to allow double-byte character of the world is the solution to this problem built in COBOL. encodings to be mixed with single-byte created by the assortment of 8-bit fonts limit- COBOL has been undergoing a silent evolu- character encodings. ed to representing 256 unique characters. tion, and the latest IBM Enterprise COBOL for The basic issue in this approach is that two z/OS brings new features that help integrate encodings can use the same number for two WHAT IS UNICODE? COBOL business processes and Web-oriented different characters, or use different numbers business processes. IBM extensions1 range for the same character. The computer has to Unicode provides a unique number for from minor relaxation of rules to major support multiple encoding schemes, and the every character, no matter what the platform, capabilities, such as XML support, Unicode data always ran the risk of getting corrupted no matter what the program, no matter what support, object-oriented COBOL for Java when passed between computers. the language. interoperability, and Double-Byte Character Unicode is the Universal-encoding stan- Sets (DBCS) character handling. NEED FOR UNICODE dard for (eventually) all the characters and Though Unicode is the default in Java, it is symbols used in all the languages of the entering the COBOL world now only. This Of the maximum of 256 characters in each world, past and present. It has merged with article describes what COBOL offers in terms code page set, the first 128 characters include and is compatible with international stan- of Unicode support. For the benefit of hard- English alphabets, punctuation marks and dard ISO/IEC 10646 (also known as the core COBOL programmers, let us first numbers. In the English-speaking world, the Universal Character Set, or UCS, for short). establish what Unicode is and what problems second set of 128 characters comprises more Unicode is a registered trademark of it is expected to solve. punctuation marks, some currency symbols Unicode, Inc., and is in development by the (such as £ and ¥), and a lot of accented letters Unicode Consortium. BEFORE UNICODE (such as á, ç, è, ñ, ô and ü). In countries like Unicode is required by standards like Egypt or Greece that use a different alphabet, XML, Java, LDAP, CORBA, and WML. Computers deal with numbers ONLY, and the second set of 128 is used for characters Unicode enables a single software product all the alphabets and other characters are from the Arabic, Greek, Hebrew, Cyrillic or or a single Web site to be used across multi- stored as numbers. The mapping of the char- Thai alphabets. ple platforms, languages and countries. It ©2004 Technical Enterprises, Inc. Reproduction of this document without permission is prohibited. Technical Support | March 2004 allows data to be transported through many different systems with- FIGURE 1: THE FIGURATIVE CONSTANT VALUES IF THE CONTEXT IS A out risking corruption. NATIONAL CHARACTER DESIGN OF UNICODE Figurative Constant National Character Value ZERO/ZEROS/ZEROES NX’0030’ The requirement for a character set that is universal, efficient, uni- SPACE/SPACES NX’0020’ form and unambiguous forms the basis for the design of the Unicode QUOTE/QUOTES NX’0027’ HIGH-VALUE / LOW-VALUE <Doesn’t Apply> Standard that assigns each character a unique numeric value and name. Each character is assigned a single number called a code point and in FIGURE 2: EXAMPLE OF NATIONAL-OF FUNCTION text form prefixed by “U.” For example, character “A” is represented by code point U+0041 (hex number 0041 and decimal equivalent 65) and is assigned the character name “LATIN CAPITAL LETTER A.” 01 Japanese-MBCS-DATA PIC X(20). 01 National-data PIC N(10) Usage National. A single code is used for representing characters or ideographs com- mon to multiple languages avoiding duplicate encoding of the characters. Move Function NATIONAL-OF(Japanese-MBCS-data, 1399) To The ASCII encoding forms the first half of ISO-8859-1 (Latin1) and National-data forms the first page Unicode, making it easier to apply Unicode to existing software and data. FIGURE 3: EXAMPLE OF DISPLAY-OF FUNCTION The Unicode Standard has three encoding forms with 8, 16 or 32 bits per code unit. The data can be transformed between these encoding Move Function DISPLAY-OF(National-data,1399) To Japanese- forms efficiently without any loss of information. MBCS-data CHARACTERS IN COBOL FIGURE 4: EXAMPLE TO CONVERT FROM EBCDIC TO ASCII The most basic and indivisible unit of the COBOL language is the 77 EBCDIC-CCSID PIC 9(4) BINARY VALUE 1140. character. The basic character set is extended with the IBM Double- 77 ASCII-CCSID PIC 9(4) BINARY VALUE 819. Byte Character Set (DBCS). DBCS characters occupy two adjacent 77 Input-EBCDIC PIC X(80). 77 ASCII-Output PIC X(80). bytes to represent one character. A character-string containing only DBCS characters is called a DBCS character-string. Move Function In IBM Enterprise COBOL, support for Unicode is provided by Display-of ( Function National-of (Input-EBCDIC EBCDIC- CCSID),ASCII-CCSID) NATIONAL data type. This allows the run-time character set that is to ASCII-output processed to include alphanumeric characters, DBCS characters and national characters. of a national literal is 80 character positions, excluding the opening and UNICODE IN COBOL closing delimiters. For example, define as: When we say Unicode support by Enterprise COBOL, the following 01 Data-in-Unicode PIC N(10) USAGE NATIONAL. are the areas to be understood: National literals in hexadecimal format have NX” or NX’ as the 1. Unicode data represented as National Data items and literals opening delimiter. The literal can include hexadecimal digits in the 2. Intrinsic functions DISPLAY-OF and NATIONAL-OF for character range ‘0’ to ‘9’; ‘a’ - f’; and ‘A’ to ‘F’. Each group of four hexadeci- conversions to and from Unicode mal digits represents a single national character and must represent a 2. Additional compiler options NSYMBOL and CODEPAGE in the valid code point in UTF-16. Hence, the length must be a multiple of Unicode context. four and can be a maximum of 320 digits. In case of national data items, the LENGTH function returns the Unicode capabilities are used implicitly by COBOL programs number of character positions, rather than the actual bytes occupied. that use object-oriented syntax for Java interoperability. Installing For example, LENGTH(Data-in-Unicode) will return a value of 10 and configuring Unicode conversion services is the prerequisite for though it occupies 20 bytes. compiling COBOL programs with National Characters or object- oriented syntax. For specific details on setting up and activating Compiler Options—NSYMBOL and CODEPAGE Unicode conversion services, refer to the Enterprise COBOL When the NSYMBOL(NATIONAL) compiler option is in effect, Customization Guide. USAGE NATIONAL is assumed and just the opening delimiter N” or N’ identifies a national literal. For example, N’AAABBBCCC’ is a National Data Items and Literals national literal in basic format. COBOL uses UTF-16 to represent national characters. This default The CODEPAGE compiler option lets you specify the code page encoding for Unicode occupies 2 bytes of storage and offers the best used for both data items and literals in the program. To support compromise between processing and storage. Unicode, IBM has included new Coded Character Set IDs (CCSID) in To define national data items holding Unicode strings, use PICTURE the range of 1200 to 1208. UTF-16 is supported by CCSID 1200, in clause with symbol “N” and USAGE NATIONAL. The maximum length big-endian format, and the CCSID for UTF-8 is 1208. Technical Support | March 2004 ©2004 Technical Enterprises, Inc. Reproduction of this document without permission is prohibited. Usage of National Literals Internationalization is the essential area where the industry is converg- National literals in the source program are converted to UTF-16 for ing on Unicode. Most significant application systems and third-party use at run time. products with international versions already support Unicode or When an alphanumeric item is moved to a national item, it is implicit- they are moving towards it. The important advantage they offer is ly converted to the UTF-16 Unicode. You can use this implicit conversion the option of having a single source (where multiple versions are achieved while moving to convert alphanumeric, DBCS and integer data earlier maintained for US, Chinese, Japanese).

Load more