<<

#IDUG

Code sets, NLS and conversion vs. DB2

Roland Schock ARS Computer und Consulting GmbH

Session : C04 2014-09-10 | Platform: LUW #IDUG

2

Overview

• What are character sets, encoding schemes and code pages? • Where can define the used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG

3

Character Sets

• Basically character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin , digits, naval flag signs or other symbols:

A, B, , ... ᇹ ぁ ゆ ㌹ ㌺ α γ π ξ 亹 怔 떟 떥 #IDUG

4

Character Encoding

• A or code page is a mapping of symbols of a character set bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or .

• Part of the encoding scheme is also the definition of a serialisation scheme to convert the code into a sequence of . #IDUG

5

ASCII

• Sample of an encoding scheme:

• First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers #IDUG

6

Single Char Sets (SBCS)

• Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars • ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars • ISO-8859-15: Modified ISO-8859-1 including - (€) • Platform specific charsets: Windows ANSI or MacRoman #IDUG

7

Double Byte Char Sets (DBCS)

• Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS #IDUG

8

EUC (Extended Code)

• Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character #IDUG

9

Unicode

• Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 have approx. 100.000 characters assigned to code points. #IDUG

10

Unicode char sets and encodings

• UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. #IDUG

11

UTF-8

• Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed #IDUG

12

Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG

13

Usage of a code page

• Code pages can be specified at different levels: • At the operating where the application runs • At the where the server runs • At the operating system where the application is prepared/bound • At the level #IDUG

14

Default code page

• As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. #IDUG

15

Specifying a code page: OS level

• Windows: → Regional and Language settings, chcp command • Linux/Unix: locale command #IDUG

16

At prepare/bind time

• Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. #IDUG

17

Defining a database w/ code page

• Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). #IDUG

18

Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG

19

Code page conversion

• If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 #IDUG

Client to server conversion #IDUG

21

Using DB2 Connect #IDUG

22

Other considerations

• Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data. #IDUG

23

More considerations

• Using different conversion tables and €-Symbol: ANSI code page and the official have a different for the Euro symbol. If needed code conversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode • For PureXML (V9.x) a UTF-8 database is needed. #IDUG

24

More considerations

• To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied. #IDUG

25

Overview

• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG

26

Troubleshooting

• Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available. #IDUG

27

Pitfalls

• Watch out for unintentional "conversions" • All database communication partners are configured correct, but the DBA is looking via a console window at the data and the console window (or putty) is using a with the wrong codepage to display the data! #IDUG

28 db2set DB2CODEPAGE

• Know what you intend to do, if you use the DB2 DB2CODEPAGE • It tells DB2 you will feed it with the right code points regardless of the displayed symbols.

• See Technote "Setting DB2CODEPAGE=1208 may result in incorrect character data insertion" SQL0191N Error occurred because of a fragmented MBCS character. http://www.ibm.com/support/docview.wss?uid=swg21601028 #IDUG

29 db2set DB2CONSOLECP

• Intended to allow DB2 CLI to use different codepages for output:

• Multiple APARs for DB2 9.1, 9.5, 9.7: "DB2CONSOLECP environment variable has no effect on DB2 message text or is not working" #IDUG

30

DB2 Special Registers for NLS

• Change message text for DB2 Monreport modules: db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'" db2 "call monreport.lockwait" • Change message names for Time/Dates: db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'" db2 "values monthname(current date)" (Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP, TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT) #IDUG

31

Performance considerations

• Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Remember: Conversion takes time. #IDUG

32

Links

• IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index. • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8 #IDUG

Roland Schock ARS Computer und Consulting GmbH [email protected]

C04 Code sets, NLS and character conversion vs. DB2