D16 Code sets, NLS and character conversion vs. DB2

Roland Schock ARS Computer und Consulting GmbH

05.10.2006 • 11:45 .m. – 12:45 p.m.

Platform: DB2 for Linux, , Windows

Code sets and character conversion is something, which is usually neglected during database design and usage. Everybody expects it will work correctly without any effort. But practice shows, the true detail and impact is often misunderstood and a few details can help adminstrators and database developers do the right thing. After some necessary definitions this presentation describes, how you can specify the used. You will see what character conversion is and how to avoid common problems. At the end will shortly discuss performance impacts.

1 Overview

• What are character sets, encoding schemes and code pages? • Where can define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations

2

On the next few slides we will define basic terms frequently used for this topic. The terms are widely used, but often they are only understood partially. In the case of problems it is essential to understand the concepts to deduct the origin of the problem.

2 Character Sets

• Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols:

A, B, C, ... α γ π ξ ᇹ ぁゆ㌹㌺

亹怔떟떥

3

Here we use the word 'set' in the mathematical context. It is an unordered collection of elements. One of the most used character sets in Europe is the latin alphabet. But this is just a very small subset of the character sets needed for the most common languages. Other, less obvious character sets are naval flag signs, symbols for the sign language of the deaf, japanese, chinese or other asiatic characters, etc.

3

• A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or . • Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes.

4

The symbols of a character set are now put in an sequence and are numbered. The ordinal number will then used as a code point for this . If we have more than 256 symbols, a single byte isn't enough to encode a charater and we have to think about an encoding scheme.

4 ASCII

• Sample of an encoding scheme:

• First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers

5

5 Single Byte Char Sets (SBCS)

• Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • Platform specific charsets: Windows ANSI or MacRoman

6

ISO-8859-1 (Latin 1): ASCII + special characters for westeuropean languages ISO-8859-2 (Latin 2): ASCII + special characters for easteuropean languages ISO-8859-3, -4, ..., -14: ASCII + special characters for arabic, greek, turk, hebrew, thailandic or baltic languages ISO-8859-15: modified ISO-8859-1 including Euro-Symbol (€)

6 Double Byte Char Sets (DBCS)

• Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS

7

7 EUC ()

• Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character

8

8 Unicode

• Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points.

9

See also: http://www.unicode.org

9 Unicode char sets and encodings

• UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. 10

Beside a mapping of characters to numbers an enconding scheme is essential to store the data in a sequence of bytes. The simplest encoding is to store a 16-bit or 32-bit wide code point in 2 or 4 bytes. This is used in UCS-2 or UCS-4. But this encoding scheme is not very efficient for latin texts which mainly consist of ASCII characters. A text string would consist mainly of 00 bytes. This would also cause problems for the string functions of the C programming language, as it uses a null byte as termination character. UTF-8 is an encoding scheme, which distributes the bits needed in one or more bytes. This requires a more sophisticated routine to read and write strings, but it allows to continue to use the C string functions. Details of the UTF-8 encoding are on the next slide.

10 UTF-8

• Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed

11

11 Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations

12

12 Usage of a code page

Code pages can be specified at different levels: • At the where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level

13

In a client/server environment, the code page used on a client needs not to be thesameas thecodepageusedon theserver. Local applications tend to use as a default the local defined code page of the operating system.

A special situation can occur in a multiplatform environment, where clients, server and the application developers generating code with static SQL use different code pages on their machines. During compilation of programs with embedded static SQL a precompile pass is used, which needs a database connection. As default the local code page is used, which can be different from the other users code pages. If the user later accesses the static SQL, a code page conversion can happen to convert the data first to the code page used for the static SQL.

During creation of a database the administrator can specify a code page of the database. This can't be changed afterwards.

13 Default code page

• As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC.

14

14 Specifying a code page: OS level

• Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command

15

15 At prepare/bind time

• Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used.

16

16 Defining a database w/ code page

• Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here).

17

17 Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations

18

18 Code page conversion

• If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332

19

In some rare cases a code page conversion is done more than once. If you import some IXF files on a client machine, a local code page conversion is used, if the IXF files were generated on another machine with a different code page (e.g. export data on a windows machine to IXF and import the data on a linux machine). When this data is sent to the server into a database with another code page, the data has to be converted a second time from the clients code page to the servers code page at the server.

19 Client to server conversion

Client Server uses code page X uses code page Y

Send data using Receive data code page X Convert to code page Y Process data Receive data in Y Return result in code page Y Convert to code page X

20

20 Using DB2 Connect Client Gateway Server uses code page X uses code page Y uses code page Z

Send data using Receive data code page X Convert to code page Y Send data in Y Receive data Convert to code page Z Receive data in Z Return result in Convert to Y code page Z Receive data in Y Return result in Y Convert to code page X 21

21 Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations

22

22 Other considerations

• Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data.

23

Details to character substitution can be found in "Application Development Guide: Programming Client Applications", Chapter 29

After a succesful connect to the database, the user/application getssome information returned in the SQLCA data area: - The second token in the SQLERRMC field (tokens are separated by X'FF') indicated the code page of the database. The ninth token indicates the code page of the application. If they are different, we will experience code page conversion. - The first and second entries in the SQLERRD array: SQLERRD(1) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data, when converted from the applications code page to the database code page. SQLERRD(2) contains this value for conversions from database code page to application code page. A value of 0 or 1 indicated no expansion. A value greater 1 indicates a possible expansion in length; a negative value a possible contraction.

23 More considerations

• Using different conversion tables and €-Symbol: Microsoft ANSI code page and the official have a different code point for the Euro symbol. If needed code coversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases • For PureXML (V9.x) a UTF-8 database is needed.

24

When a Unicode database is created, CHAR, VARCHAR, LONG VARCHAR and CLOB data are stored in UTF-8 form, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data are stored in UCS-2 big.endian form.

24 More considerations

• To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied.

25

25 Troubleshooting

Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available.

26

After a succesful connect to the database, the user/application getssome information returned in the SQLCA data area: - The second token in the SQLERRMC field (tokens are separated by X'FF') indicated the code page of the database. The ninth token indicates the code page of the application. If they are different, we will experience code page conversion. - The first and second entries in the SQLERRD array: SQLERRD(1) contains an integer value equal to the maximum expected expansion or contraction factor for the length of mixed character data, when converted from the applications code page to the database code page. SQLERRD(2) contains this value for conversions from database code page to application code page. A value of 0 or 1 indicated no expansion. A value greater 1 indicates a possible expansion in length; a negative value a possible contraction.

26 Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations

27

27 Performance considerations

• Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Conversion takes time.

28

Every unneccesary conversion costs some time during data access. It is quite minimal, but can sum up. If using Java to access data, prefer UTF-8 databases, as Java uses internally Unicode for character encoding. Keep in mind that a CHAR(10) field can contain 10 ASCII characters i.e. 10 bytes, but not necessary 10 Unicode characters. So for international applications prefer VARCHAR fields with some extra bytes left for expansion due to conversion.

28 Links

• IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8

29

29 D16 Code sets, NLS and character conversion vs. DB2

Roland Schock ARS Computer und Consulting GmbH [email protected]

30

30