Unicode to the Rescue
Total Page:16
File Type:pdf, Size:1020Kb
Session: B02 DB2 and the Tower of Babel: Unicode to the Rescue Jim Dee BMC Software Inc. 13 October 2008 • 13:30 – 14:30 Platform: DB2 for z/OS UTF-8, UTF-16, code pages, and so on - Unicode can be a daunting subject! There is much interest in supporting it - many of our businesses are expanding globally, and DBA’s and application developers will need to support Unicode data stored in DB2 tables. The topic is easier than it looks, and this presentation will help you prepare to migrate your mission critical data to Unicode. 1 Key Points • Understand Unicode concepts - what it is, and how different languages and symbols are represented. • Explain common Unicode terms like code point, plane, and surrogate. • Understand how Unicode is implemented in DB2 and what the code pages mean. • Understand the implications of Unicode for application programs and utilities. • Discuss data conversion considerations. 2 Introduction. What is Unicode? EBCDIC and ASCII code pages Unicode encoding schemes Unicode in DB2 UTF-8 and UTF-16 Code pages Catalog Migration to Unicode Application programs Utilities Data conversion 2 Tower of Babel "The Tower of Babel," oil painting by Pieter Bruegel the Elder, 1563 (credit: Courtesy of the Kunsthistorisches Museum, Vienna) 3 The Tower of Babel is a story from the book of Genesis. It is a classic tale of how an ambitious development project can be disrupted by lack of communication. Sometimes, DB2 data management can seem like the tower. We can store data associated with a particular language in DB2 tables by defining the correct code page for a DB2 subsystem, but how do we combine data representing different languages? There are so many languages! Unicode is part of the solution, but it too seems confusing. Should I use UTF- 8 or UTF-16? If Unicode is unitary, why do I need to define code pages in DSNDECP? How do I convert all my code pages to Unicode? 3 Historical Background • Paleontology – EBCDIC and ASCII • Archaeology – Universal Character Set • Ancient History – UCS-4, UTF-16, UTF-8 4 To understand Unicode, we have to look at its predecessors and some of the problems Unicode is designed to correct. So we will go back to the 1960’s and look at EBCDIC and ASCII, and some of the problems caused by their implementations. Then we will start looking at Unicode and ISO 10646, and explain the concept behind the Universal Character Set. This part of the presentation will end with a discussion of the encoding schemes of Unicode, and some other terminology. 4 EBCDIC • Announced with S/360 in 1964, for peripheral (primarily punched card) support • Still necessary for I/O to/from system z • 8 bits allow 256 characters – 164 used • Upper/lower case alphabetics, digits, punctuation, control characters for I/O devices 5 EBCDIC (Extended Binary Coded Decimal Interchange Code) is the text encoding scheme most of us are familiar with. It was announced with the System 360 in 1964! Peripheral devices today for system/z are still constrained by the requirements of an encoding scheme originally designed for punched cards. EBCDIC started as a byte oriented encoding. Each byte represents one alphabetic character, which limits the range of characters to 256. The basic problem with EBCDIC is that 256 characters are not enough to represent all the languages of the world, and in fact are not enough to represent any one of the Asian languages. We will see some of the problems caused by that restriction in a few slides. 5 EBCDIC 0123456789ABCDEF 0 1 2 3 4 sp rsp .<(+ 5 &*); 6 -¦,%_>? 7 :#@‘=“ 8 abcdef gh i 9 jklmnopq r A st uvwxy z B C A BCDEFGHI - D J KLMNOPQR E STUVWXYZ F 0 1 23456789 6 This is a generic representation of EBCDIC, showing only the code points which are common to all the single byte EBCDIC code pages. Notice the number of “unused” code points. Actually, in each of the code pages which are variants of EBCDIC, these code points are used but they can and do represent different characters. Also, the byte values X’00’ through X’3F’ are reserved for control characters and are not shown on the slide, for clarity. 6 ASCII • Announced as a standard in 1963 • Became text encoding for most computers which are not IBM mainframes • 7 bits allow 128 characters – all 128 used • Upper/lower case alphabetics, digits, punctuation, control characters for I/O devices 7 ASCII (American Standard Code for Information Interchange) also dates from the 60’s. It was the result of an organized standards effort, and became the text encoding of choice for most of the Western world It, like EBCDIC, is a byte oriented encoding where each byte represents one character. Unlike EBCDIC, only 7 bits are used, so the number of characters is limited to 128. As with EBCDIC, this does not support all languages of the world and not even one, in the case of the Asian languages. Like EBCDIC, many different code pages have evolved as extensions of ASCII, some extending to 8 bits and some to 16 bits (double byte). 7 ASCII 0123456789ABCDEF 0 1 2 sp!“#$%&‘()*+,-./ 3 0123456789: ; <=>? 4 @ABCDEFGHI J KL MNO 5 PQRSTUVWXYZ[\]^_ 6 ` abcdef ghi j kl mno 7 pqr st uvwxyz{ ¦ } ~ 8 This slide shows the original ASCII standard character set. Notice that the collating sequence is different from that of EBCDIC. Notice also that representing any language other than English and any richer symbol set will be very difficult. 8 Code Pages • Problem with both EBCDIC and ASCII was that they were originally designed by English writers • What to do if you want “é“, “θ“, “€”, or a Chinese character? • Remember those unused code points? • So...code pages! 9 The standard in both cases (EBCDIC and ASCII) specified some values which are common to all the code pages. These common characters represent the English and Western European alphabet. Each code page tends to support English and one other group of languages. We will see a few EBCDIC examples in the next few slides. 9 EBCDIC 0123456789ABCDEF 0 1 Code 2 Page 37 3 (English, 4 sprsp âäàáãåç ñ ¢ . < ( + | USA and 5 &é êëèíîïì ß !$*);¬ Canada) 6 -/ ÂÄÀÁÃÅÇ Ñ ¦ , %_>? 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“ 8 Øabcdef gh i «»ðýþ±» 9 ° jklmnopq r æ¸ Æ¤ A μ ~ st uvwxy z ¿ Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ 10 This is the code page most commonly used in the United States. It represents the English alphabet used in the United States, Canada, and the United Kingdom (at least before the Euro symbol was needed). The shaded bytes represent the b yte values which represent the same characters in all the EBCDIC code pages. Notice that a subset of other European alphabets, punctuation characters, and other common symbols is included. 10 EBCDIC 0123456789ABCDEF 0 1 Code 2 Page 500 3 (Inter- 4 sp âäàáãåç ñ[.<(+| national 5 &é êëèíîïì ß !$*);¬ English) 6 -/ ÂÄÀÁÃÅÇ Ñ ¦ , %_>? 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“ 8 Øabcdef gh i «»ðýþ± 9 ° jklmnopq r æ¸ Æ¤ A μ ~ st uvwxy z ¿ Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ ÷ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ 11 This EBCDIC code page is very close to code page 37 but there are differences. It is used primarily in Europe. 11 EBCDIC 0123456789ABCDEF 0 1 Code 2 Page 870 3 (Eastern 4 sp âäàñ áãå¢ ç . <( + | European) 5 &$*);é êëèíîïì ß ! ¬ 6 -,%_>?/ ÂÄÀÁÃÅÇ Ñ¦ 7 øÉ ÊËÈÍÎÏÌ ` :#@‘=“= 8 Øabcdegfhi «»ðýþ± 9 ° jklmnopq r æ¸ Æ¤ A μ ~¿st uvwxy z Đ ÝÞ® B ^£ ¥· ©§¶¼½ ¾ []¯¨´x C { A BCDEFGHI ôöòóõ D } J KLMNOPQR ¹ ûüùúÿ E \ ÷ STUVWXYZ² ÔÖÒÓÕ F 0 1 23456789 ³ ÛÜÙÚ 12 This is an EBCDIC code page used to support Eastern European languages. 12 EBCDIC 0123456789ABCDEF 1st byte value is 0 X’42’ 1 2 3 Code 4 £|.<(+ Page 837 5 & ! Ұ *);¬ (Simplified 6 -,%_>?/ ¦ Chinese) 7 ` :#@‘ = “ [partial] 8 abcdegfhi 9 jklmnopq r A ¯ st uvwxy z B ¨´ C { A BCDEFGHI D } J KLMNOPQR E $ STUVWXYZ F 0 1 23456789 13 Code page 837 is one of the double byte code pages which support the Asian languages, in this case Simplified Chinese. The code page supports 713 non Chinese characters (like the ones shown on this slide as an example), 3755 “level 1 Chinese” characters, 3008 “level 2 Chinese” characters, and 1880 user defined characters, for a total of 9,356 characters used out of a possible 65,536 values in 16 bits. 13 ASCII 0123456789ABCDEF 0 1 Code 2 sp!“#$%&‘()*+,-./ Page 819 3 0123456789: ; <=>? (Latin-1) 4 @ABCDEFGHI J KL MNO 5 PQRSTUVWXYZ[\]^_ 6 ` abcdef ghi j kl mno 7 pqr st uvwxyz{ | }~ 8 9 A ¡ ¢¤£ ¥©¦ § ¨ª«¬-® ¯ B °º»±µ¹² ³ ´¿¶¼½·¸ ¾ C À Á ÂÃÄÅÆÇÈÉÊËÌÍÎÏ D Ð Ñ ÒÓÔÕÖ× ØÙ Ú ÛÜÝÞß E à á âãäåæçèéê ëìíîï F ð ñ òóôõö÷øù úÿûüýþ 14 This is one of the extended ASCII code pages. The character set represented corresponds closely to EBCDIC code page 500, which we saw earlier. This code page also corresponds to ISO 8859-1. 14 What is Unicode? • Standard encoding for character sets published by The Unicode Consortium (www.unicode.org) • Release 1.0.0 published in October, 1991 (49,000 characters) • Release 5.1.0 published in April, 2008 (over 100,000 characters) • Goal is to provide unique encoding for every character in every language • Platform independent 15 Unicode started with an effort by the Unicode Consortium, a group of interested organizations, to come up with an encoding scheme for computer character data which would replace EBCDIC, ASCII, etc., and provide one common encoding scheme which could represent all characters used by all languages of the world.