Code Sets, NLS and Character Conversion Vs. DB2

Total Page:16

File Type:pdf, Size:1020Kb

Code Sets, NLS and Character Conversion Vs. DB2 #IDUG Code sets, NLS and character conversion vs. DB2 Roland Schock ARS Computer und Consulting GmbH Session Code: C04 2014-09-10 | Platform: LUW #IDUG 2 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 3 Character Sets • Basically a character set is just a collection of entities or graphical symbols with a meaning. • Examples for character sets are the latin alphabet, digits, naval flag signs or other symbols: A, B, C, ... ᇹ ぁ ゆ ㌹ ㌺ α γ π ξ 亹 怔 떟 떥 #IDUG 4 Character Encoding • A character encoding or code page is a mapping of symbols of a character set to bit patterns which are also referred as code points. A → 17, B → 23, C → 42, … • Typical examples of encodings are ASCII, EBCDIC or Unicode. • Part of the encoding scheme is also the definition of a serialisation scheme to convert the code point into a sequence of bytes. #IDUG 5 ASCII • Sample of an encoding scheme: • First version 1963, Standardized 1968 • Ordered mapping to 7-bit numbers #IDUG 6 Single Byte Char Sets (SBCS) • Extensions from 7-bit ASCII to 8-bit code pages • ISO-8859-x: ASCII + special characters for some languages • ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars • ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars • ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€) • Platform specific charsets: Windows ANSI or MacRoman #IDUG 7 Double Byte Char Sets (DBCS) • Expansion of the SBCS concept from one byte to two bytes per character • Mainly used for asiatic languages with more than 256 characters to encode • Latin text is expanded to twice the size of SBCS #IDUG 8 EUC (Extended Unix Code) • Multi Byte Char Set (MBCS): 2 or 4 bytes/char • Only used for Japanese, Korean, Traditional and Simplified Chinese on Unix platform • Uses single shift characters to switch to a another code group to build a multi byte character #IDUG 9 Unicode • Intended to simplify and unify the different definitions of code pages and hence conversion. • The first definition contained 65536 characters (16-bit, 1991, UCS-2). • Version 2.0 extended the charset with 16 planes for up to 1.114.112 characters (32-bit, 1996, UCS-4). • Today in Unicode Version 4.0 we have approx. 100.000 characters assigned to code points. #IDUG 10 Unicode char sets and encodings • UCS-2: two bytes per character • UCS-4: four bytes per character • UTF-16: Encoding of UCS-4 into one or two words: the first 64k code points use two bytes per character, all others four byte • UTF-8: dynamic or variable length encoding of characters with one to four bytes • Possible problems with UCS-2, UCS-4, UTF-16: Byte order differences (big-endian vs. little-endian) between different processor architectures. #IDUG 11 UTF-8 • Encoding in variable length sequence of bytes • Simple recognition of multibyte chars • Compact storage of text in latin chars • Only the shortest encoding allowed #IDUG 12 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 13 Usage of a code page • Code pages can be specified at different levels: • At the operating system where the application runs • At the operating system where the server runs • At the operating system where the application is prepared/bound • At the database level #IDUG 14 Default code page • As default DB2 server and clients use the local settings of the operating system or user: • Windows: The server process is using the default region settings of the operating system. • Linux/Unix: The codepage is derived from the locale setting for the instance user (i.e. the user running the database processes). • Client (LUW): The current locale settings of the user determine the code page used during CONNECT. • Programming language: Java is always using Unicode when connecting to a database via JDBC. #IDUG 15 Specifying a code page: OS level • Windows: Control Panel → Regional and Language settings, chcp command • Linux/Unix: locale command #IDUG 16 At prepare/bind time • Special case during development of database software with static, embedded SQL. • Embedded SQL needs a prepare phase before compilation of the source code. • Later the prepared package needs to be bound to the database with the bind command. • Both commands need a database connection and at the connect time; the current setting of the locale is used. #IDUG 17 Defining a database w/ code page • Explicitly set the code page at creation time: CREATE DB test USING CODESET codeset TERRITORY territory COLLATE collatingseq • Otherwise current locale is used to determine database codeset. • The choosen code page cannot be changed later. • In DB2 for iSeries and for z/OS you can also define single columns of a table in a different code set (not detailed here). #IDUG 18 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 19 Code page conversion • If application and server use a different code page, code page conversion happens. • Code page conversion is always done at the receivers side: • at the servers side for data sent from client to server • at the clients side for data sent from server to client • Exception: Importing IXF files generated on a different system with another code page • If conversion tables are missing: SQLCODE -332 #IDUG Client to server conversion #IDUG 21 Using DB2 Connect #IDUG 22 Other considerations • Mapping of characters (injective): If a character in the source code page is not contained in the target code page, it is replaced by a substitution character. • Round trip conversion (bijective): If no substitution needs to take place between source and target code pages, a round trip conversion does not loose information. • Encoding/Decoding can change the number of bytes needed to store the data. #IDUG 23 More considerations • Using different conversion tables and €-Symbol: Microsoft ANSI code page and the official code page 850 have a different code point for the Euro symbol. If needed code conversion tables can be replaced (ref. Administration Guide, Planning). • Unicode support: DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding for Unicode databases • For PureXML (V9.x) a UTF-8 database is needed. #IDUG 24 More considerations • To change a code page of a database, you have to use db2move (Export/Import). Backup/Restore cannot be used. So choosing the right database code page during database creation is crucial. • Binary data (BLOB, FOR BIT DATA) is internally stored with code page 0, so no character conversion is applied. #IDUG 25 Overview • What are character sets, encoding schemes and code pages? • Where can I define the code page used? • What is code page conversion and where does it happen? • What problems can arise and how can I avoid them? • Performance considerations #IDUG 26 Troubleshooting • Identify used code pages: • db2 get db cfg for sample Retrieves database code page • Displaying SQLCA area during CONNECT with CLP When connecting to a database via CLP the option "–a" displays the SQLCA data area, which shows the code page of the database and the connecting client. • If connecting to iSeries or zSeries machines from DB2 LUW, check if conversion tables are available. #IDUG 27 Pitfalls • Watch out for unintentional "conversions" • All database communication partners are configured correct, but the DBA is looking via a console window at the data and the console window (or putty) is using a font with the wrong codepage to display the data! #IDUG 28 db2set DB2CODEPAGE • Know what you intend to do, if you use the DB2 environment variable DB2CODEPAGE • It tells DB2 you will feed it with the right code points regardless of the displayed symbols. • See Technote "Setting DB2CODEPAGE=1208 may result in incorrect character data insertion" SQL0191N Error occurred because of a fragmented MBCS character. http://www.ibm.com/support/docview.wss?uid=swg21601028 #IDUG 29 db2set DB2CONSOLECP • Intended to allow DB2 CLI to use different codepages for output: • Multiple APARs for DB2 9.1, 9.5, 9.7: "DB2CONSOLECP environment variable has no effect on DB2 message text or is not working" #IDUG 30 DB2 Special Registers for NLS • Change message text for DB2 Monreport modules: db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'" db2 "call monreport.lockwait" • Change message names for Time/Dates: db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'" db2 "values monthname(current date)" (Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP, TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT) #IDUG 31 Performance considerations • Try to avoid unneccessary conversions. • Create databases already with the code page needed for your applications. • For international databases prefer UTF-8, especially when used with Java programs. • Remember: Conversion takes time. #IDUG 32 Links • IBM developerworks white paper: http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html • DB2 Infocenter http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp • Unicode http://www.unicode.org • UTF-8 article at Wikipedia http://en.wikipedia.org/wiki/UTF-8 #IDUG Roland Schock ARS Computer und Consulting GmbH [email protected] C04 Code sets, NLS and character conversion vs. DB2.
Recommended publications
  • Cumberland Tech Ref.Book
    Forms Printer 258x/259x Technical Reference DRAFT document - Monday, August 11, 2008 1:59 pm Please note that this is a DRAFT document. More information will be added and a final version will be released at a later date. August 2008 www.lexmark.com Lexmark and Lexmark with diamond design are trademarks of Lexmark International, Inc., registered in the United States and/or other countries. © 2008 Lexmark International, Inc. All rights reserved. 740 West New Circle Road Lexington, Kentucky 40550 Draft document Edition: August 2008 The following paragraph does not apply to any country where such provisions are inconsistent with local law: LEXMARK INTERNATIONAL, INC., PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in later editions. Improvements or changes in the products or the programs described may be made at any time. Comments about this publication may be addressed to Lexmark International, Inc., Department F95/032-2, 740 West New Circle Road, Lexington, Kentucky 40550, U.S.A. In the United Kingdom and Eire, send to Lexmark International Ltd., Marketing and Services Department, Westhorpe House, Westhorpe, Marlow Bucks SL7 3RQ. Lexmark may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
    [Show full text]
  • Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa
    Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-1-1999 Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet Akira Hasegawa Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Hasegawa, Akira, "Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet" (1999). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Hieroglyphs for the Information Age: Images as a Replacement for Characters for Languages not Written in the Latin- 1 Alphabet by Akira Hasegawa A thesis project submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Printing Management and Sciences in the College of Imaging Arts and Sciences of the Rochester Institute ofTechnology May, 1999 Thesis Advisor: Professor Frank Romano School of Printing Management and Sciences Rochester Institute ofTechnology Rochester, New York Certificate ofApproval Master's Thesis This is to certify that the Master's Thesis of Akira Hasegawa With a major in Graphic Arts Publishing has been approved by the Thesis Committee as satisfactory for the thesis requirement for the Master ofScience degree at the convocation of May 1999 Thesis Committee: Frank Romano Thesis Advisor Marie Freckleton Gr:lduate Program Coordinator C.
    [Show full text]
  • Base64 Character Encoding and Decoding Modeling
    Base64 Character Encoding and Decoding Modeling Isnar Sumartono1, Andysah Putera Utama Siahaan2, Arpan3 Faculty of Computer Science,Universitas Pembangunan Panca Budi Jl. Jend. Gatot Subroto Km. 4,5 Sei Sikambing, 20122, Medan, Sumatera Utara, Indonesia Abstract: Security is crucial to maintaining the confidentiality of the information. Secure information is the information should not be known to the unreliable person, especially information concerning the state and the government. This information is often transmitted using a public network. If the data is not secured in advance, would be easily intercepted and the contents of the information known by the people who stole it. The method used to secure data is to use a cryptographic system by changing plaintext into ciphertext. Base64 algorithm is one of the encryption processes that is ideal for use in data transmission. Ciphertext obtained is the arrangement of the characters that have been tabulated. These tables have been designed to facilitate the delivery of data during transmission. By applying this algorithm, errors would be avoided, and security would also be ensured. Keywords: Base64, Security, Cryptography, Encoding I. INTRODUCTION Security and confidentiality is one important aspect of an information system [9][10]. The information sent is expected to be well received only by those who have the right. Information will be useless if at the time of transmission intercepted or hijacked by an unauthorized person [7]. The public network is one that is prone to be intercepted or hijacked [1][2]. From time to time the data transmission technology has developed so rapidly. Security is necessary for an organization or company as to maintain the integrity of the data and information on the company.
    [Show full text]
  • SUPPORTING the CHINESE, JAPANESE, and KOREAN LANGUAGES in the OPENVMS OPERATING SYSTEM by Michael M. T. Yau ABSTRACT the Asian L
    SUPPORTING THE CHINESE, JAPANESE, AND KOREAN LANGUAGES IN THE OPENVMS OPERATING SYSTEM By Michael M. T. Yau ABSTRACT The Asian language versions of the OpenVMS operating system allow Asian-speaking users to interact with the OpenVMS system in their native languages and provide a platform for developing Asian applications. Since the OpenVMS variants must be able to handle multibyte character sets, the requirements for the internal representation, input, and output differ considerably from those for the standard English version. A review of the Japanese, Chinese, and Korean writing systems and character set standards provides the context for a discussion of the features of the Asian OpenVMS variants. The localization approach adopted in developing these Asian variants was shaped by business and engineering constraints; issues related to this approach are presented. INTRODUCTION The OpenVMS operating system was designed in an era when English was the only language supported in computer systems. The Digital Command Language (DCL) commands and utilities, system help and message texts, run-time libraries and system services, and names of system objects such as file names and user names all assume English text encoded in the 7-bit American Standard Code for Information Interchange (ASCII) character set. As Digital's business began to expand into markets where common end users are non-English speaking, the requirement for the OpenVMS system to support languages other than English became inevitable. In contrast to the migration to support single-byte, 8-bit European characters, OpenVMS localization efforts to support the Asian languages, namely Japanese, Chinese, and Korean, must deal with a more complex issue, i.e., the handling of multibyte character sets.
    [Show full text]
  • Title the Practice of Basic Informatics 2019 Author(S) Kita, Hajime
    Title The Practice of Basic Informatics 2019 Kita, Hajime; Kitamura, Yumi; Hioki, Hirohisa; Sakai, Author(s) Hiroyuki; Lin, Donghui Citation (2020): 1-196 Issue Date 2020-03-08 URL http://hdl.handle.net/2433/246166 This book is licensed under CC-BY-NC-ND. For detail, access Right the following: https://creativecommons.org/licenses/by-nc- nd/4.0/deed.en Type Learning Material Textversion publisher Kyoto University The Practice of Basic Informatics 2019 Hajime Kita, Institute for Liberal Arts and Sciences, Yumi Kitamura, Kyoto University Library, Hirohisa Hioki, Graduate School of Human and Environmental Studies, Hiroyuki Sakai, Center for the Promotion of Excellence in Higher Education, Donghui Lin, Graduate School of Informatics Kyoto University Version 2020/03/08 0. Foreword Table of Contents 0. Foreword Kyoto University provides courses on ‘The Practice of Basic Informatics’ as part of its Liberal Arts and Sciences Program. The course is taught at many schools and departments, and course contents vary to meet the requirements of these schools and departments. This textbook is made open to the students of all schools that teach these courses. As stated in Chapter 1, this book is written with the aim of building ICT skills for study at university, that is, ICT skills for academic activities. Some topics may not be taught in class. However, the book is written for self-study by students. We include many exercises in this textbook so that instructors can select some of them for their classes, to accompany their teaching plans. The courses are given at the computer laboratories of the university, and the contents of this textbook assume that Windows 10 and Microsoft Office 2016 are available in these laboratories.
    [Show full text]
  • Accredited Standards Committee Doc. No.: X3L2/SD-3 X3
    Accredited Standards Committee Doc. No.: X3L2/SD-3 X3, Information Processing Systems* Date: 4 Feb., 1994 X3L2, Codes and Character Sets Project: ADMIN Reply to: John H. Jenkins Taligent, Inc. 10201 N. DeAnza Boulevard Cupertino, CA 95014 Voice: +1 408 862-3241 FAX: +1 408 257-9681 E-mail: [email protected] X3L2, Codes and Character Sets Document Register for 1993 Table 1. X3 Standing Documents Number Title Author Date Project X3/SD-0 Information Brochure X3 8901 ADMIN X3/SD-1A Master Plan (Overview) X3 9001 ADMIN X3/SD-1B Master Plan (operational) X3 9001 ADMIN X3/SD-1C Master Plan (Strategic) X3 9102 ADMIN X3/SD-2 Organization, Rules and X3 9301 ADMIN Procedures of X3 X3/SD-3 Project Proposal Guide X3 9108 ADMIN X3/SD-4 Projects Manual X3 9212 ADMIN X3/SD-5 Standards Evaluation Criteria X3 9212 ADMIN X3/SD-6 Membership and Officers X3 9208 ADMIN X3/SD-7 Meeting Schedule and Calendar X3 9111 ADMIN X3/SD-8 Officers' Reference Manual X3 9111 ADMIN X3/SD-9 Policy and Guidelines X3 9112 ADMIN X3/SD-10 X3 Subgroup Annual Report Format X3 9212 ADMIN Table 2. X3L2 Standing Documents Number Title Author Date Project X3L2/SD-1 Membership and Mailing List Jenkins 930804 ADMIN X3L2/SD-2 Action List Jenkins 930611 ADMIN X3L2/SD- Document Register for 1993 Jenkins 030204 ADMIN 3:1993 X3L2/SD-4 Technical Committee Summary Jenkins 930804 ADMIN X3L2/SD-5 List of Members in Jeopardy with Meeting Jenkins 930804 ADMIN Attendance and Ballot Records X3L2/SD-6 X3L2 Projects List Jenkins 921215 ADMIN X3L2/SD-7 ANSI Style Manual ANSI 91-03-01 ADMIN X3L2/SD-8 IEC/ISO Directives, Part 1, Proecedures for ISO/IEC 93 ADMIN the technical work * Operating under the procedures of The American National Standards Institute X3 Secretariat, Computer and Business Equipment Manufacturers Association, 1250 Eye Street, N.W., Suite 200, Washington, DC 20005 (Telephone: 202.737.8888 FAX: 202.638.4922) Table 3.
    [Show full text]
  • PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437
    PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437 PCL Symbol Set: 10U Unicode glyph correspondence tables. Contact:[email protected] http://pcl.to -- -- -- -- $90 U00C9 Ê Uppercase e acute $21 U0021 Ë Exclamation $91 U00E6 Ì Lowercase ae diphthong $22 U0022 Í Neutral double quote $92 U00C6 Î Uppercase ae diphthong $23 U0023 Ï Number $93 U00F4 & Lowercase o circumflex $24 U0024 ' Dollar $94 U00F6 ( Lowercase o dieresis $25 U0025 ) Per cent $95 U00F2 * Lowercase o grave $26 U0026 + Ampersand $96 U00FB , Lowercase u circumflex $27 U0027 - Neutral single quote $97 U00F9 . Lowercase u grave $28 U0028 / Left parenthesis $98 U00FF 0 Lowercase y dieresis $29 U0029 1 Right parenthesis $99 U00D6 2 Uppercase o dieresis $2A U002A 3 Asterisk $9A U00DC 4 Uppercase u dieresis $2B U002B 5 Plus $9B U00A2 6 Cent sign $2C U002C 7 Comma, decimal separator $9C U00A3 8 Pound sterling $2D U002D 9 Hyphen $9D U00A5 : Yen sign $2E U002E ; Period, full stop $9E U20A7 < Pesetas $2F U002F = Solidus, slash $9F U0192 > Florin sign $30 U0030 ? Numeral zero $A0 U00E1 ê Lowercase a acute $31 U0031 A Numeral one $A1 U00ED B Lowercase i acute $32 U0032 C Numeral two $A2 U00F3 D Lowercase o acute $33 U0033 E Numeral three $A3 U00FA F Lowercase u acute $34 U0034 G Numeral four $A4 U00F1 H Lowercase n tilde $35 U0035 I Numeral five $A5 U00D1 J Uppercase n tilde $36 U0036 K Numeral six $A6 U00AA L Female ordinal (a) http://www.pclviewer.com (c) RedTitan Technology 2005 PCL PC-8, Code Page 437 Page 2 of 5 $37 U0037 M Numeral seven $A7 U00BA N Male ordinal (o) $38 U0038
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • United States Patent (19) 11 Patent Number: 5,689,723 Lim Et Al
    US005689723A United States Patent (19) 11 Patent Number: 5,689,723 Lim et al. 45) Date of Patent: Nov. 18, 1997 (54) METHOD FOR ALLOWINGSINGLE-BYTE 5,091,878 2/1992 Nagasawa et al. ..................... 364/419 CHARACTER SET AND DOUBLE-BYTE 5,257,351 10/1993 Leonard et al. ... ... 395/150 CHARACTER SET FONTS IN ADOUBLE 5,287,094 2/1994 Yi....................... ... 345/143 BYTE CHARACTER SET CODE PAGE 5,309,358 5/1994 Andrews et al. ... 364/419.01 5,317,509 5/1994 Caldwell ............................ 364/419.08 75 Inventors: Chan S. Lim, Potomac; Gregg A. OTHER PUBLICATIONS Salsi, Germantown, both of Md.; Isao Nozaki, Yamato, Japan Japanese PUPA number 1-261774, Oct. 18, 1989, pp. 1-2. Inside Macintosh, vol. VI, Apple Computer, Inc., Cupertino, (73) Assignee: International Business Machines CA, Second printing, Jun. 1991, pp. 15-4 through 15-39. Corp, Armonk, N.Y. Karew Acerson, WordPerfect: The Complete Reference, Eds., p. 177-179, 1988. 21) Appl. No.: 13,271 IBM Manual, "DOSBunsho (Language) Program II Opera 22 Filed: Feb. 3, 1993 tion Guide” (N:SH 18-2131-2) (Partial Translation of p. 79). 51 Int. Cl. ... G09G 1/00 Primary Examiner-Phu K. Nguyen 52) U.S. Cl. .................. 395/805; 395/798 Assistant Examiner-Cliff N. Vo (58) Field of Search ..................................... 395/144-151, Attorney, Agent, or Firm-Edward H. Duffield 395/792, 793, 798, 805, 774; 34.5/171, 127-130, 23-26, 143, 116, 192-195: 364/419 57 ABSTRACT The method of the invention allows both single-byte char 56) References Cited acter set (SBCS) and double-byte character set (DBCS) U.S.
    [Show full text]
  • SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting
    SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting Jason (Jianduan) Liang SAS certified: Platform Administrator, Advanced Programmer for SAS 9 Agenda Introduction UTF-8 and other encodings SAS options for encoding and configuration Other Considerations for UTF-8 data Encoding issues troubleshooting techniques (tips) Introduction What is UTF-8? . A character encoding capable of encoding all possible characters Why UTF-8? . Dominant encoding of the www (86.5%) SAS system options for encoding . Encoding – instructs SAS how to read, process and store data . Locale - instructs SAS how to present or display currency, date and time, set timezone values UTF-8 and other Encodings ASSCII (American Standard Code for Information Interchange) . 7-bit . 128 - character set . Examples (code point-char-hex): 32-Space-20; 63-?-3F; 64-@-40; 65-A-41 UTF-8 and other Encodings ISO 8859-1 (Latin-1) for Western European languages Windows-1252 (Latin-1) for Western European languages . 8-bit (1 byte, 256 character set) . Identical to asscii for the first 128 chars . Extended ascii chars examples: . 155-£-A3; 161- ©-A9 . SAS option encoding value: wlatin1 (latin1) UTF-8 and other Encodings UTF-8 and other Encodings Problems . Only covers English and Western Europe languages, ISO-8859-2, …15 . Multiple encoding is required to support national languages . Same character encoded differently, same code point represents different chars Unicode . Unicode – assign a unique code/number to every possible character of all languages . Examples of unicode points: o U+0020 – Space U+0041 – A o U+00A9 - © U+C3BF - ÿ UTF-8 and other Encodings UTF-8 .
    [Show full text]
  • Legacy Character Sets & Encodings
    Legacy & Not-So-Legacy Character Sets & Encodings Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/unicode/iuc15-tb1-slides.pdf Tutorial Overview dc • What is a character set? What is an encoding? • How are character sets and encodings different? • Legacy character sets. • Non-legacy character sets. • Legacy encodings. • How does Unicode fit it? • Code conversion issues. • Disclaimer: The focus of this tutorial is primarily on Asian (CJKV) issues, which tend to be complex from a character set and encoding standpoint. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations dc • GB (China) — Stands for “Guo Biao” (国标 guóbiâo ). — Short for “Guojia Biaozhun” (国家标准 guójiâ biâozhün). — Means “National Standard.” • GB/T (China) — “T” stands for “Tui” (推 tuî ). — Short for “Tuijian” (推荐 tuîjiàn ). — “T” means “Recommended.” • CNS (Taiwan) — 中國國家標準 ( zhôngguó guójiâ biâozhün) in Chinese. — Abbreviation for “Chinese National Standard.” 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • GCCS (Hong Kong) — Abbreviation for “Government Chinese Character Set.” • JIS (Japan) — 日本工業規格 ( nihon kôgyô kikaku) in Japanese. — Abbreviation for “Japanese Industrial Standard.” — 〄 • KS (Korea) — 한국 공업 규격 (韓國工業規格 hangug gongeob gyugyeog) in Korean. — Abbreviation for “Korean Standard.” — ㉿ — Designation change from “C” to “X” on August 20, 1997. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated Terminology & Abbreviations (Cont’d) dc • TCVN (Vietnam) — Tiu Chun Vit Nam in Vietnamese. — Means “Vietnamese Standard.” • CJKV — Chinese, Japanese, Korean, and Vietnamese. 15th International Unicode Conference Copyright © 1999 Adobe Systems Incorporated What Is A Character Set? dc • A collection of characters that are intended to be used together to create meaningful text.
    [Show full text]
  • JS Character Encodings
    JS � Character Encodings Anna Henningsen · @addaleax · she/her 1 It’s good to be back! 2 ??? https://travis-ci.org/node-ffi-napi/get-symbol-from-current-process-h/jobs/641550176 3 So … what’s a character encoding? People are good with text, computers are good with numbers Text List of characters “Encoding” List of bytes List of integers 4 So … what’s a character encoding? People are good with text, computers are good with numbers Hello [‘H’,’e’,’l’,’l’,’o’] 68 65 6c 6c 6f [72, 101, 108, 108, 111] 5 So … what’s a character encoding? People are good with text, computers are good with numbers 你好! [‘你’,’好’] ??? ??? 6 ASCII 0 0x00 <NUL> … … … 65 0x41 A 66 0x42 B 67 0x43 C … … … 97 0x61 a 98 0x62 b … … … 127 0x7F <DEL> 7 ASCII ● 7-bit ● Covers most English-language use cases ● … and that’s pretty much it 8 ISO-8859-*, Windows code pages ● Idea: Usually, transmission has 8 bit per byte available, so create ASCII-extending charsets for more languages ISO-8859-1 (Western) ISO-8859-5 (Cyrillic) Windows-1251 (Cyrillic) (aka Latin-1) … … … … 0xD0 Ð а Р 0xD1 Ñ б С 0xD2 Ò в Т … … … … 9 GBK ● Idea: Also extend ASCII, but use 2-byte for Chinese characters … … 0x41 A 0x42 B … … 0xC4 0xE3 你 0xC4 0xE4 匿 … … 10 https://xkcd.com/927/ 11 Unicode: Multiple encodings! 4d c3 bc 6c 6c (UTF-8) U+004D M “Müll” U+00FC ü 4d 00 fc 00 6c 00 6c 00 (UTF-16LE) U+006C l U+006C l 00 4d 00 fc 00 6c 00 6c (UTF-16BE) 12 Unicode ● New idea: Don’t create a gazillion charsets, and drop 1-byte/2-byte restriction ● Shared character set for multiple encodings: U+XXXX with 4 hex digits, e.g.
    [Show full text]