SAP-Migrationen Auf Unicode
Total Page:16
File Type:pdf, Size:1020Kb
SAP-Migrationen auf Unicode Sebastian Buhlinger SAP Consultant, HP-SAP EMEA CC © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Agenda 1. Introduction to Unicode 2. Unicode & SAP in General 3. Technology in Depth 3/31/2004 2 Introduction to Unicode 3/31/2004 3 1. Introduction to Unicode • History of character encoding • Problem of character encoding • From ACII to Unicode • What is Unicode exactly? • Unicode Encodings 3/31/2004 4 History of Character Encoding • Historically, computers were pretty slow, had fairly little memory and were very expensive • Up to 1960s I/O meant pushing holes into paper tapes • Most of the character sets date back to punch-card age and are designed with these cards in mind • In the early days of computers every hardware manufacturer used proprietary technology (and encodings) • International data interchange was no issue and so nothing needed to fit together 3/31/2004 5 Problem of character encoding • Which number is assigned to which character? • When typing an ‘A’ on the keyboard, the computer uses the character code as a basis for pulling the character shape of ‘A’ from a font file listing with the same binary number, and displays or prints it • The character ‘A’ may also have different integer values in different programs or data files (‘A’ might be ‘•’ in an Arabic font file) • In some instances no number available for certain characters (f.i. “ä” à Ä) • All data encoded in the form of binary numerical codes 3/31/2004 6 Character repertoire • English alphabet: with some digits and little more: ~ 60 characters • Western European Standard: ~ 300 characters for several languages • Korean: ~12.000 syllables • Chinese dictionaries: ~ 50.000 letters • Hundreds of other characters in common use, such as math and currency symbols 3/31/2004 7 From ASCII to Unicode • Most character sets and encodings in 70s/80s were modifications or extensions of ASCII • Many of them used 8-bit with a subset of the 94 used ASCII characters • Most common encodings nowadays use single byte per character (SBCS) • They are all limited to 256 characters • Due to that, none of them can even cover the letters for the Western European languages 3/31/2004 8 What is Unicode exactly? • Unicode = universally encoded character set to store information from any language • Unicode defines • properties for each character • standardizes script behavior • provides a standard algorithm for bi directional text • defines cross-mappings for other standards • Unicode defines a unique code value for every character, regardless of platform, program or programming language used 3/31/2004 9 What is Unicode exactly? • The Unicode standard primarily encodes scripts rather than languages • Scripts comprise several languages that historically share the same set of symbols • In many cases a script may serve to write dozens of languages (e.g. the Latin script) • In other cases one script complies to one language (e.g. Hangul) 3/31/2004 10 Unicode Encodings • UTF = Unicode Transformation Format • UCS = Universal Character Set • CESU = Compatibility Encoding Scheme • Conversion between different encodings is a simple, bit-wise operation (defined in standard) • No performance excessive conversion table necessary! 3/31/2004 11 Unicode Encodings • UTF-8: Unicode Transformation based on 8- bit representation • CESU-8: Compatibility Encoding Scheme of UTF-16 on an 8-bit base • UTF-16: Unicode Transformation based on 16-bit representation 3/31/2004 12 Unicode Encodings • UCS-2: Universal Character Set 2 byte variation (16-bit) • UTF-32: Unicode Transformation based on 32-bit representation • UCS-4: Universal Character Set 4 byte variation (32 bit) 3/31/2004 13 Unicode Encodings • Not all Unicode characters are 2 bytes long ’ no doubling of hw requirements in the first place • Unicode encoding determines the length of a character • Character in one Unicode encoding can be longer than 1 byte; therefore Unicode characters can be longer than characters defined in a standard code page 3/31/2004 14 Example #1 Character UTF-8 UCS-2 UTF-16 A 41 0041 0041 c 63 0063 0063 Æ C3 86 00C6 00C6 Ö C3 B6 00F6 00F6 • DA 64 0664 0664 • E4 BA 75 9875 9875 • F0 9D 84 9E N/A D834 DD1E 3/31/2004 15 Example #2 – character “•” U+AC00 UTF- 8 HEX E A B 0 8 0 BIN 1110 1010 1011 0000 1000 0000 Lead Byte Indicator Trailing Byte Indicator Remove lead bytes 1110 1010 1011 0000 1000 0000 1010 11 0000 00 0000 Regroup bits 1010 1100 0000 0000 UTF- 16 BIN 1010 1100 0000 0000 HEX A C 0 0 3/31/2004 16 Unicode & SAP in General 3/31/2004 17 2. Unicode & SAP in General • Code Pages • SAP & Code Pages • Language Combinations before Unicode • Recommendations from SAP (w/o Unicode) • When/why do customers need Unicode? 3/31/2004 18 Code Pages • The code page determine what character you can see and enter Characters on Disk/Memory 3/31/2004 19 Code Pages • different code pages map different characters to the same byte sequence Single Byte Double Byte Characters on Disk/Memory 3/31/2004 20 SAP & Code Pages 3/31/2004 21 Language Combinations before Unicode • Single Standard Code Pages • supports specific sets of languages • the number and combination of languages that are supported cannot be altered • Standard code pages and R/3 languages (w/o EBCDIC) Double-Byte Code Pages 3/31/2004 22 Language Combinations before Unicode • It is also possible to specify a customer- specific language; this language must use one of the code pages that SAP supports; see Note 0112065 3/31/2004 23 Language Combinations before Unicode • Blended Code Pages (³ Rel. 3.1D) • SAP proprietary code pages that contain characters from one or more standard code pages • increases the combinations of languages that can be used • functionally, a Blended Code Page system uses a single code page • a Blended Code Page is a single code page system • users can see and enter all characters contained in the code page, regardless of their log-in language 3/31/2004 24 Language Combinations before Unicode SAP Code Page Supported Languages 3/31/2004 25 Language Combinations before Unicode • the availability of SAP blended code pages is platform dependent, because SAP blended locales need to be created for each platform • Blended Locale Status (x = available -- = not available) 3/31/2004 26 Language Combinations before Unicode • MDMP (³ Rel. 3.1I) Multi-Display / Multi-Processing • allows dynamic code page switching on the application server • therefore permits any combination of standard code pages on one system • the log-on language determines the code page that is active for each user • an MDMP system is recommended if: 1. one or more additional code pages are required to add languages to your existing installation 2. a blended code page cannot support the combination of languages you need for a new installation. For example, an MDMP system with the code pages 1100 and 8000, allows German and Japanese users to log onto the same R/3 system in their respective languages 3/31/2004 27 Language Combinations before Unicode Front End Example 8000 - SJIS Japan Application DB Server 1100 – ISO-1 Germany • Each user can only access one code page at a time: a user who logs in as a Japanese user cannot enter German characters, and all German characters in the database will not be correctly displayed 3/31/2004 28 Language Combinations before Unicode Example Japanese German User User 3/31/2004 29 Language Combinations before Unicode Please Note: • It is possible for a user to log on with German and then manipulate the character set and font settings so that he can enter what appear to be Japanese characters; these characters will not be correctly stored in the database and this data will be corrupt • If a user wants to enter f.i. Japanese, he/she must log on in Japanese 3/31/2004 30 Language Combinations before Unicode Please Note: • To insure that no data corruption occurs, the following restrictions must be followed: •Global data must contain only 7-bit ASCII characters, which are in all code pages •Users may use only the characters of their log-in language or 7-bit ASCII •Batch processes must be assigned with the correct user ID and language •EBCDIC code pages are not supported 3/31/2004 31 Recommendations from SAP (w/o Unicode) • In general, using a single standard code page for new installations and upgrades is the optimal decision • If additional languages or language combinations are needed, SAP recommends Unambiguous Blended Code Pages for new installations and MDMP for existing installations • Unambiguous Blended Code Pages only support certain language combinations and therefore an MDMP setup may be the only possibility for new installations as well 3/31/2004 32 Unicode-compliant SAP products • All Unicode installations are currently planned only with written permission of SAP carried out as customer projects together with SAP, except of new installations of R/3 Enterprise Extension Set 2.0 3/31/2004 33 When/why do customers need Unciode? • Global businesses that require IT systems to support multilingual data without any restrictions ’ f.i. customers with one WW central SAP system • Web interfaces open the door to a global customer base, and IT systems must consequently be able to support multiple local languages simultaneously 3/31/2004 34 When/why do customers need Unciode? • With J2EE integration, mySAP components fully support web standards, and with Unicode, it now can take full advantage of XML and Java • Only Unicode makes it possible to seamlessly integrate inhomogeneous SAP and non-SAP system landscapes ’ NetWeaver 3/31/2004 35 Technology in Depth 3/31/2004 36 3.