Code Sets, NLS and Character Conversion Vs. DB2

#IDUG Code sets, NLS and character conversion vs. DB2 Roland Schock ARS Computer und Consulting GmbH Session Code: C04 2014-09-10 | Platform: LUW #IDUG 2 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 3 Character Sets • Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols: A, B, C, ... ᇹ ぁゆ㌹㌺ α γ π ξ 亹怔 떟 떥 #IDUG 4 Character Encoding • A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or Unicode. • Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes. #IDUG 5 ASCII • Sample of an encoding scheme: • First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers #IDUG 6 Single Byte Char Sets (SBCS) • Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars • ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars • ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€) • Platform specific charsets: Windows ANSI or MacRoman #IDUG 7 Double Byte Char Sets (DBCS) • Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS #IDUG 8 EUC (Extended Unix Code) • Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character #IDUG 9 Unicode • Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points. #IDUG 10 Unicode char sets and encodings • UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. #IDUG 11 UTF-8 • Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed #IDUG 12 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 13 Usage of a code page • Code pages can be specified at different levels: • At the operating system where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level #IDUG 14 Default code page • As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. #IDUG 15 Specifying a code page: OS level • Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command #IDUG 16 At prepare/bind time • Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. #IDUG 17 Defining a database w/ code page • Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). #IDUG 18 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 19 Code page conversion • If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 #IDUG Client to server conversion #IDUG 21 Using DB2 Connect #IDUG 22 Other considerations • Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data. #IDUG 23 More considerations • Using different conversion tables and €-Symbol: Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code conversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases • For PureXML (V9.x) a UTF-8 database is needed. #IDUG 24 More considerations • To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied. #IDUG 25 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 26 Troubleshooting • Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available. #IDUG 27 Pitfalls • Watch out for unintentional "conversions" • All database communication partners are configured correct, but the DBA is looking via a console window at the data and the console window (or putty) is using a font with the wrong codepage to display the data! #IDUG 28 db2set DB2CODEPAGE • Know what you intend to do, if you use the DB2 environment variable DB2CODEPAGE • It tells DB2 you will feed it with the right code points regardless of the displayed symbols. • See Technote "Setting DB2CODEPAGE=1208 may result in incorrect character data insertion" SQL0191N Error occurred because of a fragmented MBCS character. http://www.ibm.com/support/docview.wss?uid=swg21601028 #IDUG 29 db2set DB2CONSOLECP • Intended to allow DB2 CLI to use different codepages for output: • Multiple APARs for DB2 9.1, 9.5, 9.7: "DB2CONSOLECP environment variable has no effect on DB2 message text or is not working" #IDUG 30 DB2 Special Registers for NLS • Change message text for DB2 Monreport modules: db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'" db2 "call monreport.lockwait" • Change message names for Time/Dates: db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'" db2 "values monthname(current date)" (Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP, TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT) #IDUG 31 Performance considerations • Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Remember: Conversion takes time. #IDUG 32 Links • IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8 #IDUG Roland Schock ARS Computer und Consulting GmbH [email protected] C04 Code sets, NLS and character conversion vs. DB2.

Code Sets, NLS and Character Conversion Vs. DB2

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support