CONNECT TO COMMUNITY. At SunGard Summit, come together as a community dedicated to education.
UTF-8 data in Banner 8.0, now what?
Presented by: Arnold J. Smith III Virginia Tech March 23, 2009 Course ID 0996
Session Rules of Etiquette
• Please turn off your cell phone/pager • If you must leave the session early, please do so as discreetly as possible • Please avoid side conversation during the session
Thank you for your cooperation!
Course ID 0996 2
Introduction
• Share information about UTF-8 —The concept of “speaking” UTF-8 —Testing and test data —Considerations beyond the database —Configuration information • Questions and Answers
Course ID 0996 3 Agenda Slide
• Does your software “speak” UTF-8? —“Speak” UTF-8? What do you mean? —Single versus multiple bytes —Test data • UTF-8 and the outside world —Interfaces and UTF-8 —Some interfaces will not support Unicode • Additional UTF-8 information —Tips and tricks —Various configuration options explained • API user exits at Virginia Tech
Course ID 0996 4
CONNECT TO COMMUNITY. At SunGard Summit, come together as a community dedicated to education.
Does your software “speak” UTF-8?
Speak “UTF-8”? What do you mean?
• What do the letters “Summit” mean? —A meeting/retreat typically involving the highest level decision makers within a given domain —summit -> gipfel -> cumbre • What is “a”? —A number we have all agreed represents the character “a” —Unicode notation U+0061 Encoding Decimal Hex ASCII 97 61 Windows-1252 97 61 UTF-8 97 61 EBCDIC 129 81 UTF-16 00 97 00 61
Course ID 0996 6 UTF-8 encoding
• Standard characters, symbols and control codes —Use 1 byte —Standard Keyboard —Decimal value < 127 —Hex value < 80 —Identical to ASCII, Windows-1252, ISO-8859-1 • Extended characters, symbols, etc. —Use 2 to 4 bytes —Supports multiple languages and symbol sets
Character Windows-1252 UTF-8 Unicode ä E4 C3 A4 U+00E4 ™ 99 E2 84 A2 U+2122 E2 8C 81 U+2301
Course ID 0996 7
Single versus multiple bytes
• “Tämpico™” might appear as “Tämpicoâ„¢”
UTF-8 Hex Windows-1252 T 54 T ä C3 A4 ä m 6D m p 70 p i 69 i c 63 c o 6F o ™ E2 8C 81 â„¢
• Possible Reasons —Software not configured for UTF-8 —Software doesn’t support UTF-8
Course ID 0996 8
Practical Example
• sqlplus on database server —Sun Solaris 5.10 —Database converted to AL32UTF8 —NLS_LANG=AMERICAN_AMERICA.AL32UTF8 —LANG=en_US.UTF-8 • The name “Moët” appears as “Moët” —SSH client did not support UTF-8 —Use putty with UTF-8 enabled
Course ID 0996 9 Test Data
• Quality of test data is critical —Configuration Issues —Support for UTF-8 • What constitutes good test data —Multi-byte characters —Maximum field length —Asian characters —Esoteric characters • The web plus copy and paste is your friend
Course ID 0996 10
CONNECT TO COMMUNITY. At SunGard Summit, come together as a community dedicated to education.
UTF-8 and the outside world
Interfaces and UTF-8
• Interface defined —An interface defines the communication boundary between two entities, such as a piece of software, a hardware device, or a user. • Interfaces and character encoding —Both entities must agree on character encoding —Some interfaces specify encoding • XML files • HTTP protocol —Most interfaces assume a default encoding • Some interfaces exist internal to a single machine
Course ID 0996 12 Multiple interface example
LANG NLS_LANG sqlplus
Oracle AL32UTF8
SSH Client SSH Client Configuration
Course ID 0996 13
Interface thoughts
• Interfaces that “just worked” —Internet Native Banner —Banner Self-Service • Interfaces impacted by UTF-8 —Desktop query tools —Reports —Non-banner applications —Other entities on campus —Outside agencies (IRS, bank, etc.) —Outsourced systems • Lessons Learned —Character encoding issues existed before UTF-8 —Need for better data flow documentation
Course ID 0996 14
Some interfaces will not support Unicode
• Are you really going to encounter UTF-8 data? • Possible Solutions —Use NLS_LANG like • AMERICAN_AMERICA.WE8MSWIN1252 • AMERICAN_AMERICA.US7ASCII —Use conversion tool like iconv • Possible data loss —Is the use of substitution characters acceptable? —Will the data be fed back into Banner? • Possible data corruption —Does the data contain “keys” that will be used to access or update information in Banner?
Course ID 0996 15 CONNECT TO COMMUNITY. At SunGard Summit, come together as a community dedicated to education.
Additional UTF-8 information
Query Tools
• Fully support AL32UTF8 —Microsoft Access 2007 —Oracle SQL Developer 1.5 • Requires NLS_LANG set to WE8MSWIN1252 —Quest Software SQL Navigator 5.5 • Old tools that will not connect to AL32UTF8 database —Oracle Data Browser (Oracle Developer/2000) —Oracle Query Builder (Oracle Developer 6i)
Course ID 0996 17
UTF-8 files on Windows XP
• Multiple Tools —Notepad —WordPad —Microsoft Office Word 2007 —OpenOffice.org Writer • Encoding —Unicode (UTF-8)
Course ID 0996 18 BYTE versus CHAR semantics
• BYTE semantics —Specify the # of bytes —VARCHAR2(3 BYTE) • CHAR semantics —Specify the # of characters —VARCHAR2(3 CHAR) • Character Semantics Default —NLS_LENGTH_SEMANTICS —Session or instance level —Controls how VARCHAR2(3) is interpreted • Oracle Article —Globalize with Character Semantics
Course ID 0996 19
NLS_LANG
• language_territory.charset —AMERICAN_AMERICA.AL32UTF8 —AMERICAN_AMERICA.WE8MSWIN1252 —AMERICAN_AMERICA.WE8ISO8859P1 —AMERICAN_AMERICA.US7ASCII • Oracle NLS_LANG FAQ
Database Client Substitution NLS_CHARACTERSET NLS_LANG charset Character AL32UTF8 WE8ISO8859P1 ¿ AL32UTF8 WE8MSWIN1252 ¿ AL32UTF8 US7ASCII ?
Course ID 0996 20
LANG and locale (Solaris)
• LANG environment variable —Controls the character set used by the Unicode libraries employed by Banner 8.0 Pro*C —Adapts Solaris to a specific geographic market and corresponding character set • Determine current locale using locale command LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_ALL= • Solaris locale FAQ
Course ID 0996 21 Identify data in Oracle
• SPRIDEN names with characters beyond ASCII SELECT * FROM SPRIDEN WHERE SPRIDEN_CHANGE_IND IS NULL AND ( CONVERT( CONVERT(SPRIDEN_LAST_NAME,'US7ASCII'), 'AL32UTF8','US7ASCII') != SPRIDEN_LAST_NAME OR CONVERT( CONVERT(SPRIDEN_FIRST_NAME,'US7ASCII'), 'AL32UTF8','US7ASCII') != SPRIDEN_FIRST_NAME OR CONVERT( CONVERT(SPRIDEN_MI,'US7ASCII'), 'AL32UTF8','US7ASCII') != SPRIDEN_MI )
Course ID 0996 22
Identify data in Oracle
• SPRIDEN names with characters beyond Windows-1252 SELECT * FROM SPRIDEN WHERE SPRIDEN_CHANGE_IND IS NULL AND ( CONVERT( CONVERT(SPRIDEN_LAST_NAME, 'WE8MSWIN1252'), 'AL32UTF8', 'WE8MSWIN1252') != SPRIDEN_LAST_NAME OR CONVERT( CONVERT(SPRIDEN_FIRST_NAME, 'WE8MSWIN1252'), 'AL32UTF8', 'WE8MSWIN1252') != SPRIDEN_FIRST_NAME OR CONVERT( CONVERT(SPRIDEN_MI,'WE8MSWIN1252'), 'AL32UTF8','WE8MSWIN1252') != SPRIDEN_MI )
Course ID 0996 23
Hyperion SQR 8.5 – SQR.INI
[Default-Settings] Default-Numeric = V30 NewGraphics = True AutoDetectUnicodeFiles = FALSE UseUnicodeInternal = FALSE Output-File-Mode = Short OutputTwoDigitYearWarningMsg = FALSE
Course ID 0996 24 Hyperion SQR 8.5 – SQR.INI
[Environment:Common] Encoding = UTF-8 Encoding-SQR-Source = ASCII Encoding-File-Input = UTF-8 Encoding-File-Output = UTF-8 Encoding-Console = UTF-8 Encoding-Database = UTF-8 Encoding-Report-Output = ASCII
Course ID 0996 25
Hyperion SQR 8.5 – write statement
• length uses byte semantics —Example write 1 from &LNAME:30 '^':1 &FNAME:30 —Workaround let $LNAME = substr(&LNAME,1,30) let $FNAME = substr(&FNAME,1,30) write 1 from $LNAME '^' $FNAME
Course ID 0996 26
Hyperion SQR 8.5 – read statement
• length uses byte semantics —Example read 1 into $LNAME:30 $FNAME:30 —Workaround read 1 into $RECORD let $LNAME = substr($RECORD,1,30) let $FNAME = substr($RECORD,31,30)
Course ID 0996 27 CONNECT TO COMMUNITY. At SunGard Summit, come together as a community dedicated to education.
API user exits at Virginia Tech
API user exits at Virginia Tech
• We do not officially support full spectrum of UTF-8 • Wanted to limit critical fields to Windows-1252 • Added user exits to the following APIs —FB_INVOICE_HEADER —GB_ADDRESS —GB_IDENTIFICATION —GB_BIO
Course ID 0996 29
Summary
• Multi-byte nature of UTF-8 • Importance of quality test data • Interfaces and character set encoding —Do you know how data flows in and out? • Level of support for UTF-8
Course ID 0996 30 Questions & Answers
• Be sure to leave about 10-15 minutes for questions from your audience
Course ID 0996 31
Thank You! Arnold J. Smith III [email protected]
Please complete the online class evaluation form Course ID 0996
SunGard, the SunGard logo, Banner, Campus Pipeline, Luminis, PowerCAMPUS, Matrix, and Plus are trademarks or registered trademarks of SunGard Data Systems Inc. or its subsidiaries in the U.S. and other countries. Third-party names and marks referenced herein are trademarks or registered trademarks of their respective owners.
© 2009 SunGard. All rights reserved.
Course ID 0996 32