IDUG EU 2006 Roland Schock: Code Sets, NLS, Character Conversion Vs

D16 Code sets, NLS and character conversion vs. DB2 Roland Schock ARS Computer und Consulting GmbH 05.10.2006 • 11:45 a.m. – 12:45 p.m. Platform: DB2 for Linux, Unix, Windows Code sets and character conversion is something, which is usually neglected during database design and usage. Everybody expects it will work correctly without any effort. But practice shows, the true detail and impact is often misunderstood and a few details can help adminstrators and database developers to do the right thing. After some necessary definitions this presentation describes, how you can specify the code page used. You will see what character conversion is and how to avoid common problems. At the end we will shortly discuss performance impacts. 1 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations 2 On the next few slides we will define basic terms frequently used for this topic. The terms are widely used, but often they are only understood partially. In the case of problems it is essential to understand the concepts to deduct the origin of the problem. 2 Character Sets • Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols: A, B, C, ... α γ π ξ ᇹ ぁゆ㌹㌺亹怔떟떥 3 Here we use the word 'set' in the mathematical context. It is an unordered collection of elements. One of the most used character sets in Europe is the latin alphabet. But this is just a very small subset of the character sets needed for the most common languages. Other, less obvious character sets are naval flag signs, symbols for the sign language of the deaf, japanese, chinese or other asiatic characters, etc. 3 Character Encoding • A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or Unicode. • Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes. 4 The symbols of a character set are now put in an sequence and are numbered. The ordinal number will then used as a code point for this symbol. If we have more than 256 symbols, a single byte isn't enough to encode a charater and we have to think about an encoding scheme. 4 ASCII • Sample of an encoding scheme: • First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers 5 5 Single Byte Char Sets (SBCS) • Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • Platform specific charsets: Windows ANSI or MacRoman 6 ISO-8859-1 (Latin 1): ASCII + special characters for westeuropean languages ISO-8859-2 (Latin 2): ASCII + special characters for easteuropean languages ISO-8859-3, -4, ..., -14: ASCII + special characters for arabic, greek, turk, hebrew, thailandic or baltic languages ISO-8859-15: modified ISO-8859-1 including Euro-Symbol (€) 6 Double Byte Char Sets (DBCS) • Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS 7 7 EUC (Extended Unix Code) • Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character 8 8 Unicode • Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points. 9 See also: http://www.unicode.org 9 Unicode char sets and encodings • UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. 10 Beside a mapping of characters to numbers an enconding scheme is essential to store the data in a sequence of bytes. The simplest encoding is to store a 16-bit or 32-bit wide code point in 2 or 4 bytes. This is used in UCS-2 or UCS-4. But this encoding scheme is not very efficient for latin texts which mainly consist of ASCII characters. A text string would consist mainly of 00 bytes. This would also cause problems for the string functions of the C programming language, as it uses a null byte as termination character. UTF-8 is an encoding scheme, which distributes the bits needed in one or more bytes. This requires a more sophisticated routine to read and write strings, but it allows to continue to use the C string functions. Details of the UTF-8 encoding are on the next slide. 10 UTF-8 • Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed 11 11 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations 12 12 Usage of a code page Code pages can be specified at different levels: • At the operating system where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level 13 In a client/server environment, the code page used on a client needs not to be thesameas thecodepageusedon theserver. Local applications tend to use as a default the local defined code page of the operating system. A special situation can occur in a multiplatform environment, where clients, server and the application developers generating code with static SQL use different code pages on their machines. During compilation of programs with embedded static SQL a precompile pass is used, which needs a database connection. As default the local code page is used, which can be different from the other users code pages. If the user later accesses the static SQL, a code page conversion can happen to convert the data first to the code page used for the static SQL. During creation of a database the administrator can specify a code page of the database. This can't be changed afterwards. 13 Default code page • As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. 14 14 Specifying a code page: OS level • Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command 15 15 At prepare/bind time • Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. 16 16 Defining a database w/ code page • Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). 17 17 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations 18 18 Code page conversion • If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 19 In some rare cases a code page conversion is done more than once. If you import some IXF files on a client machine, a local code page conversion is used, if the IXF files were generated on another machine with a different code page (e.g. export data on a windows machine to IXF and import the data on a linux machine).

IDUG EU 2006 Roland Schock: Code Sets, NLS, Character Conversion Vs

Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa

SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L

Title the Practice of Basic Informatics 2019 Author(S) Kita, Hajime

Accredited Standards Committee Doc. No.: X3L2/SD-3 X3

IBM Data Conversion Under Websphere MQ

AIX Globalization

A Ruse Secluded Character Set for the Source

International Language Environments Guide

JFP Reference Manual 5 : Standards, Environments, and Macros

Traditional Chinese Solaris User's Guide

Introduction to Japanese Computational Linguistics Francis Bond and Timothy Baldwin

Building Cmap Files for CID-Keyed Fonts