#IDUG
Code sets, NLS and character conversion vs. DB2
Roland Schock ARS Computer und Consulting GmbH
Session Code: C04 2014-09-10 | Platform: LUW #IDUG
2
Overview
• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG
3
Character Sets
• Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols:
A, B, C, ... ᇹ ぁ ゆ ㌹ ㌺ α γ π ξ 亹 怔 떟 떥 #IDUG
4
Character Encoding
• A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or Unicode.
• Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes. #IDUG
5
ASCII
• Sample of an encoding scheme:
• First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers #IDUG
6
Single Byte Char Sets (SBCS)
• Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars • ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars • ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€) • Platform specific charsets: Windows ANSI or MacRoman #IDUG
7
Double Byte Char Sets (DBCS)
• Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS #IDUG
8
EUC (Extended Unix Code)
• Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character #IDUG
9
Unicode
• Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points. #IDUG
10
Unicode char sets and encodings
• UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. #IDUG
11
UTF-8
• Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed #IDUG
12
Overview
• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG
13
Usage of a code page
• Code pages can be specified at different levels: • At the operating system where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level #IDUG
14
Default code page
• As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. #IDUG
15
Specifying a code page: OS level
• Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command #IDUG
16
At prepare/bind time
• Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. #IDUG
17
Defining a database w/ code page
• Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). #IDUG
18
Overview
• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG
19
Code page conversion
• If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 #IDUG
Client to server conversion #IDUG
21
Using DB2 Connect #IDUG
22
Other considerations
• Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data. #IDUG
23
More considerations
• Using different conversion tables and €-Symbol: Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code conversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases • For PureXML (V9.x) a UTF-8 database is needed. #IDUG
24
More considerations
• To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied. #IDUG
25
Overview
• What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG
26
Troubleshooting
• Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available. #IDUG
27
Pitfalls
• Watch out for unintentional "conversions" • All database communication partners are configured correct, but the DBA is looking via a console window at the data and the console window (or putty) is using a font with the wrong codepage to display the data! #IDUG
28 db2set DB2CODEPAGE
• Know what you intend to do, if you use the DB2 environment variable DB2CODEPAGE • It tells DB2 you will feed it with the right code points regardless of the displayed symbols.
• See Technote "Setting DB2CODEPAGE=1208 may result in incorrect character data insertion" SQL0191N Error occurred because of a fragmented MBCS character. http://www.ibm.com/support/docview.wss?uid=swg21601028 #IDUG
29 db2set DB2CONSOLECP
• Intended to allow DB2 CLI to use different codepages for output:
• Multiple APARs for DB2 9.1, 9.5, 9.7: "DB2CONSOLECP environment variable has no effect on DB2 message text or is not working" #IDUG
30
DB2 Special Registers for NLS
• Change message text for DB2 Monreport modules: db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'" db2 "call monreport.lockwait" • Change message names for Time/Dates: db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'" db2 "values monthname(current date)" (Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP, TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT) #IDUG
31
Performance considerations
• Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Remember: Conversion takes time. #IDUG
32
Links
• IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8 #IDUG
Roland Schock ARS Computer und Consulting GmbH [email protected]
C04 Code sets, NLS and character conversion vs. DB2