<<

Session: B02 DB2 and Tower of Babel: the Rescue

Jim Dee BMC Inc.

13 October 2008 • 13:30 – 14:30 Platform: DB2 for /

UTF-8, UTF-16, pages, and on - Unicode can a daunting subject! There is much interest in supporting it - many of our businesses are expanding globally, and DBA’ and application developers will need to support Unicode data stored in DB2 tables. The topic is easier than it looks, and this presentation will help you prepare to migrate your mission critical data to Unicode.

1 Key Points

• Understand Unicode concepts - what it is, and how different languages and symbols are represented. • Explain common Unicode terms like code , , and surrogate. • Understand how Unicode is implemented in DB2 and what the code pages mean. • Understand the implications of Unicode for application programs and utilities. • Discuss considerations.

2

Introduction. What is Unicode? EBCDIC and ASCII code pages Unicode encoding schemes Unicode in DB2 UTF-8 and UTF-16 Code pages Catalog Migration to Unicode Application programs Utilities Data conversion

2 Tower of Babel

"The Tower of Babel," oil painting by Pieter Bruegel the Elder, 1563 (credit: Courtesy of the Kunsthistorisches Museum, Vienna)

3

The Tower of Babel is a story from the book of Genesis. It is a classic tale of how an ambitious development project can be disrupted by lack of communication.

Sometimes, DB2 data management can seem like the tower. can store data associated with a particular language in DB2 tables by defining the correct code for a DB2 subsystem, but how do we combine data representing different languages? There are so many languages!

Unicode is part of the solution, but it too seems confusing. Should use UTF- 8 or UTF-16? If Unicode is unitary, why do I need to define code pages in DSNDECP? How do I convert all my code pages to Unicode?

3 Historical Background

• Paleontology – EBCDIC and ASCII

• Archaeology – Set

• Ancient History – UCS-4, UTF-16, UTF-8

4

To understand Unicode, we have to look at its predecessors and some of the problems Unicode is designed to correct. So we will go back to the 1960’s and look at EBCDIC and ASCII, and some of the problems caused by their implementations.

Then we will start looking at Unicode and ISO 10646, and explain the concept behind the Universal Character Set.

This part of the presentation will end with a discussion of the encoding schemes of Unicode, and some other terminology.

4 EBCDIC

• Announced with S/360 in 1964, for peripheral (primarily ) support

• Still necessary for I/ to/from z

• 8 allow 256 characters – 164 used • Upper/lower case alphabetics, digits, , control characters for I/O devices

5

EBCDIC (Extended Binary Coded Decimal Interchange Code) is the text encoding scheme most of us are familiar with. It was announced with the System 360 in 1964! Peripheral devices today for system/z are still constrained by the requirements of an encoding scheme originally designed for punched cards.

EBCDIC started as a oriented encoding. Each byte represents one alphabetic character, which limits the range of characters to 256.

The basic problem with EBCDIC is that 256 characters are not enough to represent all the languages of the world, and in fact are not enough to represent any one of the Asian languages. We will see some of the problems caused by that restriction in a few slides.

5 EBCDIC 0123456789ABCDEF

0 1 2 3 4 sp rsp .<(+ 5 &*); 6 -¦,%_>? 7 :#@‘=“ 8 abcdef i 9 jklmnopq A st uvwxy z A BCDEFGHI - KLMNOPQR STUVWXYZ 0 1 23456789

6

This is a generic representation of EBCDIC, showing only the code points which are common to all the single byte EBCDIC code pages. Notice the of “unused” code points. Actually, in each of the code pages which are variants of EBCDIC, these code points are used but they can and do represent different characters. Also, the byte values ’00’ through X’3F’ are reserved for control characters and are not shown on the slide, for clarity.

6 ASCII

• Announced as a standard in 1963

• Became text encoding for most which are not IBM mainframes

• 7 bits allow 128 characters – all 128 used • Upper/lower case alphabetics, digits, punctuation, control characters for I/O devices

7

ASCII (American Standard Code for Information Interchange) also dates from the 60’s. It was the result of an organized standards effort, and became the text encoding of choice for most of the Western world

It, like EBCDIC, is a byte oriented encoding where each byte represents one character. Unlike EBCDIC, only 7 bits are used, so the number of characters is limited to 128. As with EBCDIC, this does not support all languages of the world and not even one, in the case of the Asian languages.

Like EBCDIC, many different code pages have evolved as extensions of ASCII, some extending to 8 bits and some to 16 bits (double byte).

7 ASCII

0123456789ABCDEF

0 1 2 sp!“#$%&‘()*+,-./ 3 0123456789: ; <=>? 4 @ABCDEFGHI J KL MNO 5 PQRSTUVWXYZ[\]^_ 6 ` abcdef ghi j kl mno 7 pqr st uvwxyz{ ¦ } ~

8

This slide shows the original ASCII standard character set. Notice that the collating sequence is different from that of EBCDIC. Notice also that representing any language other than English and any richer set will be very difficult.

8 Code Pages

• Problem with both EBCDIC and ASCII was that they were originally designed by English writers

• What to do if you want “é“, “θ“, “€”, or a Chinese character?

• Remember those unused code points?

• So...code pages!

9

The standard in both cases (EBCDIC and ASCII) specified some values which are common to all the code pages. These common characters represent the English and Western European . Each tends to support English and one other group of languages. We will see a few EBCDIC examples in the few slides.

9 EBCDIC 0123456789ABCDEF

0 1 Code 2 Page 37 3 (English, 4 sprsp âäàáãåç ñ ¢ . < ( + | USA and 5 &é êëèíîïì ß !$*);¬ Canada) 6 -/ ÂÄÀÁÃÅÇ Ñ ¦ , %_>? 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“ 8 Øabcdef gh i «»ðýþ±» 9 ° jklmnopq r æ¸ Æ¤ A μ ~ st uvwxy z ¿ Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ

10

This is the code page most commonly used in the . It represents the used in the United States, , and the United Kingdom (at least before the symbol was needed). The shaded represent the b yte values which represent the same characters in all the EBCDIC code pages. Notice that a subset of other European , punctuation characters, and other common symbols is included.

10 EBCDIC 0123456789ABCDEF

0 1 Code 2 Page 500 3 (Inter- 4 sp âäàáãåç ñ[.<(+| national 5 &é êëèíîïì ß !$*);¬ English) 6 -/ ÂÄÀÁÃÅÇ Ñ ¦ , %_>? 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“ 8 Øabcdef gh i «»ðýþ± 9 ° jklmnopq r æ¸ Æ¤ A μ ~ st uvwxy z ¿ Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ ÷ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ

11

This EBCDIC code page is very close to but there are differences. It is used primarily in Europe.

11 EBCDIC 0123456789ABCDEF

0 1 Code 2 Page 870 3 (Eastern 4 sp âäàñ áãå¢ ç . <( + | European) 5 &$*);é êëèíîïì ß ! ¬ 6 -,%_>?/ ÂÄÀÁÃÅÇ Ñ¦ 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“= 8 Øabcdegfhi «»ðýþ± 9 ° jklmnopq r æ¸ Æ¤ A μ ~¿st uvwxy z Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ ÷ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ

12

This is an EBCDIC code page used to support Eastern European languages.

12 EBCDIC 0123456789ABCDEF 1st byte value is 0 X’42’ 1 2 3 Code 4 £|.<(+ Page 837 5 & ! Ұ *);¬ (Simplified 6 -,%_>?/ ¦ Chinese) 7 ` :#@‘ = “ [partial] 8 abcdegfhi 9 jklmnopq r A ¯ st uvwxy z B ¨´ C { A BCDEFGHI D } J KLMNOPQR E $ STUVWXYZ F 0 1 23456789

13

Code page 837 is one of the double byte code pages which support the Asian languages, in this case Simplified Chinese. The code page supports 713 non (like the ones shown on this slide as an example), 3755 “level 1 Chinese” characters, 3008 “level 2 Chinese” characters, and 1880 user defined characters, for a total of 9,356 characters used out of a possible 65,536 values in 16 bits.

13 ASCII 0123456789ABCDEF

0 1 Code 2 sp!“#$%&‘()*+,-./ Page 819 3 0123456789: ; <=>? (-1) 4 @ABCDEFGHI J KL MNO 5 PQRSTUVWXYZ[\]^_ 6 ` abcdef ghi j kl mno 7 pqr st uvwxyz{ | }~ 8 9 A ¡ ¢¤£ ¥©¦ § ¨ª«¬-® ¯ B °º»±µ¹² ³ ´¿¶¼½·¸ ¾ C À Á ÂÃÄÅÆÇÈÉÊËÌÍÎÏ D Ð Ñ ÒÓÔÕÖ× ØÙ Ú ÛÜÝÞß E à á âãäåæçèéê ëìíîï F ð ñ òóôõö÷øù úÿûüýþ

14

This is one of the extended ASCII code pages. The character set represented corresponds closely to EBCDIC code page 500, which we saw earlier. This code page also corresponds to ISO 8859-1.

14 What is Unicode?

• Standard encoding for character sets published by The (www.unicode.org)

• Release 1.0.0 published in October, 1991 (49,000 characters)

• Release 5.1.0 published in April, 2008 (over 100,000 characters)

• Goal is to provide unique encoding for every character in every language • Platform independent

15

Unicode started with an effort by the Unicode Consortium, a group of interested organizations, to come up with an encoding scheme for character data which would replace EBCDIC, ASCII, etc., and provide one common encoding scheme which could represent all characters used by all languages of the world. This coincided with a standards development effort by the International , to ISO 10646.

These two developments merged in the 1990’s and since Unicode version 3.0 have been the same.

The goal of Unicode is to provide a platform independent encoding for every conceivable character of every human language.

15 What is Unicode?

• Started with ISO 10646 standard • First draft in 1990 • Goal was Universal Character Set

16

The Universal Character Set was the original goal of the standard development, and it led to UCS-4.

16 UCS-4 and UTF-32

• Represents each character as 31 integer • Values +0000 to U+7FFFFFFF

• Aligns with UTF-32 • UTF-32 represents values from U+0000 to U+10FFFF

• 1-1 correspondence with Unicode “code points” • 1,114,112 possible values, 100,713 used

17

UCS-4 is a straightforward representation of each character as a 31 bit integer (fullword on system z). So, over billion values from 0 to 7FFFFFFF hex can be represented.

UTF-32, which is part of the Unicode standard, is a subset of UCS-4. It uses 32 bits to represent each character, but only the Unicode code from U+0000 to U+10FFFF is used. This allows for 1,114,112 possible values, of which 100,713 are assigned in Unicode 5.1.0.

UTF-32 is very simple. Each character is a length, which makes finding a character in a text string or searching text for a specific character easy. Each value is the same as the Unicode . So, you may ask why would we want to use any other representation?

The problem with UTF-32 is space. Using 4 bytes for each character when many characters can be represented with a single byte seems wasteful to DBA’s, among others.

17 UTF-32

Character UTF-32 Value

A U+00000041

€ U+000020AC

U+0000FEF6 ﻶ

18

This slide shows selected characters and their UTF-32 values.

We will discuss alternatives to representing each character with a 32 bit value, but first we need to look at the allocation of the Unicode code space.

18 How is Code Space Allocated?

U+100000

U+10FFFF U+10000 U+0000

Plane 0

U+1FFFF U+FFFF

19

The Unicode code space of hex 10FFFF values has been divided for convenience into 17 “planes”, each consisting of 256 groups of 256 characters each. Therefore, each plane could conceivably hold 65,536 characters. Only planes 0, 1, 2, 14, 15, and 16 are used. The rest are reserved for future usage.

19 More Plane Details

• Plane 0 (U+0000 to U+FFFF) is also known as the “Basic Multilingual Plane” (BMP) • 1st 256 code points follow ISO 8859-1 – first 128 follow ASCII • Almost all characters of modern languages are included • Latin, Greek, Cyrillic, , symbols, , , Han ideographs

20

Plane 0, is known as the Basic Multilingual Plane, or BMP; it contains all the non Asian characters in modern languages, all the commonly used punctuation and other symbols, and the commonly used characters for Chinese, Japanese, and Korean scripts.

An important point is that the first 128 code points correspond to the ASCII values we saw earlier. So, for example, code point U+00000031 (sometimes shortened to “U+0031” or “U+31”) represents the English character “1”, which is X’31’ in ASCII.

20 More Plane Details

• Plane 1 (U+10000 to U+1FFFF) is “Supplementary Multilingual Plane” (SMP) • Historic scripts, notational systems • Plane 2 is “Supplementary Ideographic Plane” • Less commonly used ideographs for Asian languages • Planes 3 through 13 are unassigned • Plane 14 is “Supplementary Special Purpose Plane” • Planes 15 and 16 are reserved for private use

21

Planes 1 and 2 include the language characters which are not found in the BMP. Notice that all commonly used characters in modern languages can be found in the BMP; this is an important point to which we will return later. Some Chinese, Japanese, and Korean ideographs are found in the SIP (plane 2).

Plane 14 (the “Supplementary Special Purpose Plane” is used for format control characters for which there was not room in the BMP.

The private use characters are reserved for computer hardware or software vendors or businesses who can agree on a set of character representations to use. that 6400 private use characters are reserved in the BMP, and these are supplemented by 131,068 more in planes 15 and 16.

21 So why not use UTF-32?

Unicode

How many characters?

How many bytes?

22

UTF-32 has many advantages – it is simple, it is fixed length, it makes calculating lengths and offsets easy, and so on. It’s disadvantage (to DBA’s and others) is storage use. The character string “Unicode” is 7 characters, for example, and DBA’s react badly to the idea of using 28 bytes to store this value when 7 could be used.

So we will look at alternative encodings of Unicode which use storage more efficiently.

22 UCS-2

• 16 bit mapping of BMP

• Supports only U+0000 to U+FFFF

• Historical interest only

23

UCS-2 is mentioned only to clarify UTF-16. UTF-16 is a superset of UCS-2 which has its advantages without limiting the character set to the BMP.

23 UTF-16

• “Unicode Transformation Format” (16 bits)

• Supports Unicode code space from U+0000 to U+10FFFF

• Code points in BMP are supported directly • “Surrogate code points” U+D800 to U+DFFF are reserved

24

UTF-16, like UTF-8, can represent the entire code space of Unicode. It is a variable length encoding; either 2 bytes or 4 bytes are used.

First, any code point in the BMP is represented directly in a 16 bit value. So, all the characters used in Western modern languages can be represented in 2 bytes, and most of the commonly used Asian characters. Characters not in the BMP use 4 bytes, using the 2048 surrogate code points from U+D800 to U+DFFF. The use of the surrogate code points is explained in the next slide.

24 UTF-16

• Code points above U+FFFF are encoded into 2 16 bit values • 1st is between U+D800 and U+DBFF • 2nd is between U+DC00 and U+DFFF

• Almost all useful characters represented in 2 bytes

• Some represented in 4

25

The algorithm for encoding characters outside the BMP is as follows: 1) Subtract U+10000 from the Unicode code point value, leaving a 20 bit value. 2) Split the 20 bit value into two 10 bit values. 3) Add the high order 10 bit value to U+D800. Notice that this will leave you with a value between D800 and DBFF. This value will become the first 16 bits of the UTF-16 encoding. 4) Add the low order 10 bit value to U+DC00, leaving a value between DC00 and DFFF. This will become the second 16 bits of the UTF-16 encoding.

Notice that this encoding is one to one and reversible. Converting from UCS-4 to UTF-16 or vice versa is straightforward. Notice also that looking at any two bytes in a UTF-16 string can tell you whether they represent a 2 byte character, the first half of a 4 byte character, or the second half of a 4 byte character.

For example, the UTF-16 encoding of “A” (U+0041) is 0041. The encoding of the Euro symbol, U+20AC, is 20AC, and the encoding for U+120C1, which is a symbol, is D808 DCC1.

25 UTF-8

• Variable length mapping

• ASCII characters fit in 1 byte (U+00 to U+7F)

• Values from U+0080 to U+07FF fit in 2 bytes

• From U+0800 to U+FFFF fit in 3 bytes

• From U+10000 to U+10FFFF fit in 4 bytes

26

The algorithm to convert a Unicode code point to UTF-8 is based on bit mappings. ASCII characters map directly, so if the high order bit of any byte is on, we know it is part of a longer value. So, “A”, U+41, maps to 41. Two byte values map as follows: 00000aaa aabbbbbb maps to 110aaaaa 10bbbbbb, so “Δ” (U+0394) becomes CE94. Three byte values map as follows: aaaabbbb bbcccccc maps to 1110aaaa 10bbb,bbb 10cccccc, so the Euro symbol (U+20AC) becomes E282AC. Four byte values map as follows: 000aaaaa bbbbcccc ccdddddd maps to 11110aaa 10aabbbb 10cccccc 10dddddd, so U+120C1 becomes F0928381.

Note that this mapping, like UTF-16, is one to one and reversible.

It is the most complicated of the encodings, but: •ASCII characters are represented in 1 byte values •Just about any character in any modern European language can be represented in 2 bytes. •Any character in the BMP will fit in 3 bytes.

The bottom line is that, for European languages, UTF-8 will be significantly more compact than UTF-16. For Asian languages, UTF-16 will probably be more compact.

26 Unicode in DB2

• CHAR, VARCHAR, CLOB stored as UTF-8 • Declared length is number of bytes • FOR SBCS DATA means CCSID 367 (ASCII) • FOR MIXED DATA is default

• GRAPHIC, VARGRAPHIC, DBCLOB stored as UTF- 16 • Declared length is number of double byte characters

27

There is much confusion about the storage of Unicode data in DB2. Really, it is very simple (at least conceptually). If a table is defined as UNICODE, any character in it is either UTF-8 or UTF-16. CHAR, VARCHAR, and CLOB columns are UTF-8; GRAPHIC, VARGRAPHIC, and DBCLOB are UTF-16. The confusion arises because of past usage in EBCDIC of GRAPHIC and MIXED columns. If you remember that Unicode data is stored in either UTF-8 or UTF-16, MIXED is completely irrelevant. Specifying “FOR SBCS DATA” for a UTF-8 column limits the allowed values to ASCII (code page 367).

Another source of confusion is the length of character columns. The UTF-8 columns are declared with a length or maximum length that describes the number of bytes. So, for example, if you want a CHAR column to hold up to 20 UTF-8 characters, and you anticipate values spanning the entire Unicode code space, you need to define the column as “CHAR(80)”. If on the other hand, you know will store only ASCII characters, the column can be “CHAR(20)”.

UTF-16 columns are declared with a length that describes the number of double byte characters (a Unicode character which maps to a 4 byte value is considered 2 characters for this purpose), so the number of bytes reserved for the column is twice the declared length. Again, if you anticipate values spanning the entire code space, you will need 4 bytes for each character, but if you know you will have values only from the BMP, you need only 2 bytes for each character.

27 Unicode in DB2

• CCSID EBCDIC | ASCII | UNICODE is now an option in CREATE DATABASE, TABLESPACE, or TABLE

• All tables in tablespace must be same encoding scheme

28

Again, Unicode is much simpler than EBCDIC or ASCII. Once you choose Unicode as the encoding scheme, each character column must be either UTF-8 or UTF-16. You make the choice based on storage requirements, the characteristics of your data, and the format expected by systems your DB2 data interacts with.

28 Code Page Conversion

• Conceptually, any EBCDIC or ASCII character can be converted to Unicode

• A randomly selected Unicode character may not convert to EBCDIC or ASCII • Substitution byte is used

• z/OS Conversion Services is used • Tailor your table and reduce z/OS overhead!

29

Code page conversion is the process of converting data in one code page to another. It is clear that any ASCII or EBCDIC character can be converted to Unicode, because the Unicode code space encompasses all the characters that are included in the other encoding schemes. However, as we have seen, it is easy to find Unicode code points which correspond to characters which cannot be found in any given code page. We will talk in the next few slides why you might want to convert from EBCDIC in particular to Unicode, and vice versa.

When a character from one code page cannot be translated to any character in a requested target code page, a substitution byte can be used. Typically this is an unprintable character so the user can know that an untranslatable character has been found. The typical values for substitution bytes are X’3F’ for EBCDIC, X’1A’ for ASCII/UTF-8, and X’001A’ for UTF-16.

DB2 for z/OS calls z/OS Conversion Services to accomplish code page conversions to and from Unicode. The prebuilt DB2 conversion image is configured to support many common code page conversions, and it consumes about 36M of fixed page storage. You can rebuild it to support only the conversions you need, and probably make it much smaller. Details can be found in the DB2 Installation Guide and in SA22-7649, “Support for Unicode: Unicode Services”.

29 DSNDECP Values

• 3 values for Unicode • USCCSID=367 (Unicode single byte code page) • Really ASCII • FOR SBCS DATA • UMCCSID=1208 (Unicode mixed code page) • UTF-8 • Default for Unicode data • UGCCSID=1200 (Unicode graphic code page) • UTF-16 • GRAPHIC columns

30

You have choice about what values to include. As discussed earlier, the references to “MIXED” and “GRAPHIC” are there for historical reasons and can be confusing.

30 More DSNDECP Values

• MIXED=YES | NO • No effect on Unicode data • FOR MIXED DATA is default for Unicode even when MIXED=NO • This does affect EBCDIC and ASCII columns

• ENSCHEME=EBCDIC | ASCII | UNICODE • Default encoding scheme for CREATE of databases, tablespaces, tables, procedures, etc.

31

MIXED= YES/NO has absolutely no effect on Unicode columns.

If you want to encourage the use of Unicode, you can set it to the default for CREATE DATABASE, etc. with “ENSCHEME=UNICODE”.

31 More DSNDECP Values

• APPENSCH=EBCDIC | ASCII | UNICODE | cssid

32

APPENSCH, in contrast to ENSCHEME, defines what encoding scheme you expect the outer world to use. This is where Unicode in DB2 for z/OS comes into head on contact with EBCDIC!

This value is what DB2 assumes incoming SQL values, including your host variables, are coded in, regardless of how the corresponding column is defined or what encoding scheme your package or plan has been bound with.

32 DB2 Catalog

• 17 tables in catalog converted during V8 ENFM

• All internal parsing in Unicode

• Names can now be unreadable in EBCDIC

33

The focus of this presentation is on user data in Unicode, so I do not have much to say about the catalog conversion. You will care if you have application programs which read or update the catalog, especially if they join catalog tables to user defined EBCDIC tables.

You can (with difficulty) define tables, columns, etc. to have Unicode names which are not translatable to EBCDIC. Don’ do it!

33 Migration to Unicode

• Application Programs

• Utilities

• Data Conversion

34

The next several slides will discuss aspects of migrating data which is probably in one or more EBCDIC code pages to Unicode. The three areas to think about are your application programs, utilities (primarily Unload, Load, and Reorg), and data conversion.

34 Your Applications

• What code page(s) are your source programs in? • APPENSCH (from DSNDECP) or CCSID precompiler option • CURRENT APPLICATION ENCODING SCHEME special register for dynamic SQL • ENCODING option of BIND • What matters are characters like “¢” , “¬”, “€”

• Don’t forget – the collating sequence changes! • space, digits, upper case, lower case in Unicode

35

Your source code stored on the mainframe is almost certainly EBCDIC! In DB2 V8 or DB2 9, if you set “APPENSCH=EBCDIC” in DSNDECP, DB2 will act in this regard as it always has, assuming your inputs are coded in the system EBCDIC code page. The CURRENT APPLICATION ENCODING SCHEME special register is the corresponding value for dynamic SQL. All your host variables and static SQL statements are assumed to be in this encoding scheme. You can specify ENCODING as a BIND option, but CCSID can only be used if it corresponds to the system EBCDIC code page. You need to make sure that the effective specification matches what you are storing after emulation, source management, etc. For instance, if you store your source in code page 500 and your APPENSCH value is 285, you may have a problem with SQL that includes the characters listed, or others which have different code points in different EBCDIC code pages.

As long as your code deals with Unicode values that can be translated into your EBCDIC code page, you should have no problem. You do need to remember that the collating sequence is different, so ORDER BY with a character column may return your data in a different order.

“DECLARE VARIABLE CCSID UNICODE” can be used to avoid conversion of a Unicode column value to EBCDIC, with substitution values. You can try a retrieval into an EBCDIC value first, check for SQL code +335, and then retrieve the same column declared as CCSID UNICODE. Of course, your application logic still has to do something sensible with the value.

35 Your Applications

• If your input and output is all in one code page (EBCDIC 870, for example) – no problem!

• General issue is storing data in Unicode which “belong” to more than one code page

• Lets look at a hypothetical example

36

Assuming your external world is EBCDIC, you will have a problem only if you are converting to Unicode so you can store data (like Asian characters) which cannot be translated into your EBCDIC code page, or you are combining data stores (and possible applications) which use different EBCDIC code pages.

36 Global Business

Code page 1142 in Denmark Code page 1147 in Code page 875 in Unicode in Greece data center

37

Our example is a European business which has a central data store and z/os server, with peripherals (input and output) in France (code page 1147), Greece (code page 875), and Denmark (code page 1142). Remember that characters like “A” map to the same code point in each of these code pages, but other characters like the Euro symbol (X’FC’ in 875, X’5A’ in 1142, and X’9F’ in 1147), “@” (X’7C’, X’80’, and X’44’), and “[“ (X’4A’, X’9E’, and X’9F’) do not.

The problem is that the business must code their SQL in some EBCDIC code page and it must access Unicode tables which can store values for all of these characters and others which are unique to each code page.

37 Example continued

• You want all internal processing to use Unicode • Point of going to Unicode • Avoids conversion issues in processing

• User can tell you what code page is using • Actually, would probably be location

• You could try to deduce code page from data • Difficult and risky

38

First of all, keep all internal values in Unicode and do any processing except I/O using Unicode. This avoids conversion issues and keeps your applications simple. This is the point of going to Unicode in the first place!

This means that incoming data must be explicitly converted from an EBCDIC code page to Unicode, and outgoing data must be converted from EBCDIC to Unicode. In the second case, you must allow for the fact that some of the data will not be translatable to the target code page (Greek characters in France, for example).

The safest approach is either to use the location of the input or output to decide the code page being use, or include code in the application to ask the user. Location should work.

38 Example continued

• Use any EBCDIC code page for your code, but use only common character subset

• Convert everything (even selection criteria) to Unicode upon entry • DECLARE VARIABLE and z/OS conversion services

• Convert everything to appropriate code page for output • DECLARE VARIABLE again • Handle sub bytes and SQL warnings? • Consider collating sequence for output • Consider host variable lengths carefully

39

The first rule is to be careful to use only the common characters in your SQL, regardless of which code page you choose to use. If you have to hard code a character which does not translate into all code pages, it should be in SQL which is performing an internal function which is processing only Unicode values.

One possibility for input is to SET CURRENT APPLICATION ENCODING SCHEME to the source code page, and use dynamic SQL to handle incoming values. If you want to use static SQL, use DECLARE VARIABLE for three copies of each host variable, and set and use the appropriate variable for the input code page.

39 Utilities - Unload

• Use only EBCDIC for PUNCHDDN (human readable)

• UNLDDN default is encoding scheme of source • Can specify UNICODE or EBCDIC

• DELIMITED with UNICODE means UTF-8 • Must use X’value’ for COLDEL and CHARDEL

• STRIP applied after conversion

• WHEN can use EBCDIC values or X’value’

40

Unload works transparently if you are unloading the entire contents of a Unicode table for loading into another Unicode table. It can be confusing if you are using Unload and Load to convert from Unicode to EBCDIC or vice versa.

Selection (the “WHEN” clause) can use an EBCDIC value, which will get translated to Unicode if the source tablespace is Unicode, or you can specify X’value’ for untranslatable values. When X’value’ is used, no translation is done.

40 Bibliography

• www.unicode.org

• Redbook “DB2 UDB for z/OS Version 8: Everything You Ever Wanted to Know, ... and More”

• SC18-9854, “DB2 Version 9.1 for z/OS SQL Reference”

• SES1-2950, “DB2 Version 9.1 for z/OS Utility Guide and Reference”

• SC22-7649, “z/OS Support for Unicode: Unicode Services”

41

41 Session #B02 DB2 and the Tower of Babel: Unicode to the Rescue

Jim Dee BMC Software Inc. [email protected]

42

42