Code Sets, NLS and Character Conversion Vs. DB2
Total Page:16
File Type:pdf, Size:1020Kb
#IDUG Code sets, NLS and character conversion vs. DB2 Roland Schock ARS Computer und Consulting GmbH Session Code: C04 2014-09-10 | Platform: LUW #IDUG 2 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 3 Character Sets • Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols: A, B, C, ... ᇹ ぁ ゆ ㌹ ㌺ α γ π ξ 亹 怔 떟 떥 #IDUG 4 Character Encoding • A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or Unicode. • Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes. #IDUG 5 ASCII • Sample of an encoding scheme: • First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers #IDUG 6 Single Byte Char Sets (SBCS) • Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars • ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars • ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€) • Platform specific charsets: Windows ANSI or MacRoman #IDUG 7 Double Byte Char Sets (DBCS) • Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS #IDUG 8 EUC (Extended Unix Code) • Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character #IDUG 9 Unicode • Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points. #IDUG 10 Unicode char sets and encodings • UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. #IDUG 11 UTF-8 • Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed #IDUG 12 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 13 Usage of a code page • Code pages can be specified at different levels: • At the operating system where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level #IDUG 14 Default code page • As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. #IDUG 15 Specifying a code page: OS level • Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command #IDUG 16 At prepare/bind time • Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. #IDUG 17 Defining a database w/ code page • Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). #IDUG 18 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 19 Code page conversion • If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 #IDUG Client to server conversion #IDUG 21 Using DB2 Connect #IDUG 22 Other considerations • Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data. #IDUG 23 More considerations • Using different conversion tables and €-Symbol: Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code conversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases • For PureXML (V9.x) a UTF-8 database is needed. #IDUG 24 More considerations • To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied. #IDUG 25 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 26 Troubleshooting • Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available. #IDUG 27 Pitfalls • Watch out for unintentional "conversions" • All database communication partners are configured correct, but the DBA is looking via a console window at the data and the console window (or putty) is using a font with the wrong codepage to display the data! #IDUG 28 db2set DB2CODEPAGE • Know what you intend to do, if you use the DB2 environment variable DB2CODEPAGE • It tells DB2 you will feed it with the right code points regardless of the displayed symbols. • See Technote "Setting DB2CODEPAGE=1208 may result in incorrect character data insertion" SQL0191N Error occurred because of a fragmented MBCS character. http://www.ibm.com/support/docview.wss?uid=swg21601028 #IDUG 29 db2set DB2CONSOLECP • Intended to allow DB2 CLI to use different codepages for output: • Multiple APARs for DB2 9.1, 9.5, 9.7: "DB2CONSOLECP environment variable has no effect on DB2 message text or is not working" #IDUG 30 DB2 Special Registers for NLS • Change message text for DB2 Monreport modules: db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'" db2 "call monreport.lockwait" • Change message names for Time/Dates: db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'" db2 "values monthname(current date)" (Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP, TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT) #IDUG 31 Performance considerations • Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Remember: Conversion takes time. #IDUG 32 Links • IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8 #IDUG Roland Schock ARS Computer und Consulting GmbH [email protected] C04 Code sets, NLS and character conversion vs. DB2.