Turn Yourself Into a SAS® Internationalization Detective in One Day: a Macro for Analyzing SAS Data of Any Encoding Houliang Li, HL SASBIPROS INC
Total Page:16
File Type:pdf, Size:1020Kb
Paper 4645-2020 Turn Yourself into a SAS® Internationalization Detective in One Day: A Macro for Analyzing SAS Data of Any Encoding Houliang Li, HL SASBIPROS INC ABSTRACT With the growing popularity of using Unicode in SAS®, it is more and more likely that you will be handed SAS data sets encoded in UTF-8. Or at least they claim those are genuine UTF-8 SAS data. How can you tell? What happens if you are actually required to generate output in plain old ASCII encoding from data sources of whatever encoding? This paper provides a solution that gives you all the details about the source data regarding potential multibyte Unicode and other non-ASCII characters in addition to ASCII control characters. Armed with the report, you can go back to your data provider and ask them to fix the problems that they did not realize they have been causing you. If that is not feasible, you can at least run your production jobs successfully against cleaned data after removing all the exotic characters. INTRODUCTION Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems, according to Wikipedia. It concerns character variables and their values only. UTF-8, or utf-8, is the most popular implementation of Unicode. When you install SAS here in North America, whether on a PC or server or in a multitier deployment, UTF-8 has been one of the three encodings installed by default for many years, with the other two being Wlatin1 / Latin1 (English) and Shift-JIS (DBCS). Of course, you are free to install additional encodings for other languages such as French, German, Russian, and Chinese, etc. if you feel you may need to process data or generate output encoded in those languages. With the introduction of SAS Viya in recent years, UTF-8 became the default encoding, period! Although you do get the option to change to one of several encodings at the system level with more current versions of SAS Viya, being the default encoding says a lot about the importance of UTF-8. Don’t be surprised if all SAS data sets from SAS Viya come in UTF-8 encoding. Say goodbye to your favorite Wlatin1 encoding on Windows or Latin1 encoding on Linux / Unix. In reality, however, not many places are actively using SAS Viya at the moment, even though it is trending up. If you are somehow worried about this SAS Internationalization thing (I18N for short because there are 18 letters between the first letter I and last letter N), it is probably premature for the foreseeable future. For most companies and government agencies in North America, their data sources and output will still be in English, and Wlatin1 or Latin1 encoding should continue to serve them well. If you are new, or relatively new, to the I18N band wagon or the concept of character encoding in general, encoding is literally mapping characters such as letters and symbols to numbers (called code points) on a character map (called a code page). The ASCII table is perhaps the most famous character map. Computers look smart and scary, but they are actually dumb. All they know and can handle is only bits, i.e., binary digits, or 0s and 1s. It is just that they can process those bits incredibly fast. ASCII, or American Standard Code for Information Interchange, is one of the earliest computer encoding schemes. It initially used only 7 bits, i.e., from 000 0000 to 111 1111, or 0 to 127 in decimal form, which means it could only represent a total of 128 unique characters. The first 32 (values 0 to 31) are control characters 1 and usually unprintable or invisible, and the last one is the “Delete” control character, also unprintable. The rest include uppercase and lowercase letters, decimal digits, some common symbols and punctuation marks, and the space. For example, the uppercase letter “S” is represented by the decimal number 83 (or 101 0011 in binary), which is its code point in ASCII. Similarly, the lowercase letter “s” has code point 115 (or 111 0011 in binary), and the space character, arguably the first printing or visible character in ASCII, occupies code point 32 (or 010 0000 in binary). Just Google “ASCII table” to see how your favorite letter or lucky number is represented in ASCII. For the purpose of understanding this paper, we only need to know about three more encodings really: Wlatin1, Latin1, and UTF-8. ISO 8859-1, also known in SAS as Latin1, adds one more bit to ASCII, so it ranges from 0000 0000 to 1111 1111 in binary, or 0 to 255 in decimal. That means it can represent 256 unique characters. The first 128 are the same ASCII characters, and the added 96 contain characters sufficient for the most common Western European languages. Wlatin1, or Windows Latin1, is officially known as Windows 1252, or Code Page 1252 or cp1252. It is very similar to Latin1, except it has some additional symbols in the range 128 to 159 whereas Latin1 has nothing for those code points (hence only 96 additional characters out of 128 possible spots). Latin1 is the default encoding for SAS on Linux / Unix, while Wlatin1 is the default for Windows SAS. They are both exactly one byte in size. Table 1 below shows all the printable characters that are available in WLATIN1. LATIN1 includes all of the same characters except those enclosed in the blue rectangle: Table 1: The Wlatin1 Character Set Just like Wlatin1 and Latin1, UTF-8 also has its first 128 characters identical to ASCII. However, that is where the similarity ends. While both Wlatin1 and Latin1 use 8 bits (or one 2 byte) and can represent up to 256 characters (remember Latin1 does not use all the available code points), UTF-8 uses 8 bits / one byte for the 128 ASCII characters only. It does not utilize the remaining 128 code points for anything at all! What makes UTF-8 so powerful is that it can use 16, 24, even 32 bits (or two, three, even four bytes!) to represent all the other characters in the world. It is an MBCS (Multi Byte Character Set). All those characters in Wlatin1 and Latin1 beyond ASCII have their corresponding representations in UTF-8 as either a two-byte or three-byte code point. In addition, UTF-8 has a unique bit pattern for each of those multibyte representations. For example, the first byte will always begin with bits 110, 1110, or 11110, indicating either two, three, or four bytes in total for the character. The remaining bytes always start with bits 10 to indicate they are continuing with the same character. After removing those indicator bits, a 16-bit / two-byte UTF-8 representation has 11 bits left (110X XXXX 10XX XXXX) to represent 2**11 = 2048 characters. The first 128 code points overlap ASCII as represented by the one-byte UTF-8, so the net increase is actually 2048 – 128 = 1920 characters. Similarly, a 24-bit / three-byte UTF-8 encoding can represent a total of 2**16 = 65536 characters (again notice the overlap with 2-byte UTF-8), and a 32-bit / four-byte UTF-8 representation can encode 2**21, or 2,097,152 characters. Clearly, two million+ is sufficient to represent all written human languages, dead or alive, plus a lot of other symbols such as emojis. So far, Unicode has actually defined less than 1.2 million code points. There are simply not enough alphabets and logograms and other symbols in the world to encode! I know you all like this emoji: 11110000:10011111:10011000:10000011 (hexadecimal equivalent: F0 9F 98 83), which is the smiley face. But do you also like this emoji? 11110000:10011111:10010010:10101001 (hexadecimal equivalent: F0 9F 92 A9). Maybe you do and have even used it too, but you need to figure it out first! Just in case you are not so familiar with hexadecimal representations, it is a more efficient (read: shorter) way of representing binary values. It ranges from 0 to 15, with each of the 16 values denoted by decimal digits 0 through 9 and continuing from letters A through F (or a through f, since case does not matter), i.e. A = 10, B = 11, and F = 15, etc. Each hexadecimal (hex for short) symbol / digit corresponds to exactly four binary digits. For example, the letter “S” is 0101 0011 in binary, which adds up to 2**6 + 2**4 + 2**1 + 1 = 83 (in decimal). Its hex equivalent is 53, which adds up to 5 * 16**1 + 3 = 83 (in decimal). We know its code point in Wlatin1 / Latin1 / UTF-8 is 83. Everything matches perfectly. To really bring the concept home, let’s examine a multibyte UTF-8 character. Given the following binary representation in UTF-8 (Please note: colons between bytes are for easy identification purposes only and not required): 11100010:10001001:10100100, or its hex equivalent E2 89 A4, our task is to figure out its code point in Unicode. For the binary representation, we need to first remove the italicized bits because they are multibyte indicators. Next we assemble ALL the remaining bits (depicted in Bold) in the same order as listed and treat the resultant binary string as a single number, which should reveal the code point for the character: 0010 001001 100100 = 10 0010 0110 0100 = 2**13 + 2**9 + 2**6 + 2**5 + 2**2 = 8192 + 512 + 64 + 32 + 4 = 8804.