Information, Characters, Unicode Information – Characters
Total Page:16
File Type:pdf, Size:1020Kb
Information, Characters, Unicode Information { Characters In modern computing, natural-language text is very important information. (\Number-crunching" is less important.) Characters of text are represented in several different ways and a known character encoding is necessary to exchange text information. For many years an important encoding standard for characters has been US ASCII{a 7-bit encoding. Since 7 does not divide 32, the ubiquitous word size of computers, 8-bit encodings are more common. Very common is ISO 8859-1 aka \Latin-1," and other 8-bit encodings of characters sets for languages other than English. Currently, a very large multi-lingual character repertoire known as Unicode is gaining importance. US ASCII (7-bit), or bottom half of Latin1 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SS SI DLE DC1 DC2 DC3 DC4 NAK SYN ETP CAN EM SUB ESC FS GS RS US !"# $ %&'()*+,-./ 0123456789:;<=>? @ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_ `abcdefghijklmno pqrstuvwxyz{|}~ DEL Control Characters Notice that the first twos rows are filled with so-called control characters. This characters have no printable representation. Except for various standards for indicating lines of text, these characters have no use today. So, nearly one-quarter of the space available for representing characters is wasted. Of course, the space character does not have a printable representation (no ink is used to print a space), but it is extremely useful. Some US-ASCII Characters Each character has a unique bit pattern used to represent it (and a Unicode name as we shall see later). binary oct dec char Unicode 0000 1001 0011 9 HT U+0009 horizontal tabulation 0010 0000 0040 32 U+0020 space 0010 1110 0056 46 . U+002E full stop 0010 1111 0057 47 / U+002F solidus 0011 0000 0060 48 0 U+0030 digit zero 0011 0001 0061 49 1 U+0031 digit one Top half of Latin-1 With 8 bits 256 characters can be encoded. Latin-1 is twice as big as US-ASCII. The extra characters allow languages like Spanish and German to be written in Latin-1. XXX XXX BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 DCS PU1 PU2 SST CCH MS SPA EPA SOS XXX SCI CSI ST OSC PM APC ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « : SHY ® ¯ ° ± 2 3 ´ µ ¶ · ¸ 1 º » ¼ ½ ¾ ¿ A` A´ A^ A~ A¨ A˚ Æ C¸ E` E´ E^ E¨ I` I´ I^ I¨ Ð N~ O` O´ O^ O~ O¨ × Ø U` U´ U^ U¨ Y´ Þ ß `a ´a ^a ~a ¨a ˚a æ ¸c `e ´e ^e ¨e `ı ´ı ^ı ¨ı đ ~n `o ´o ^o ~o ¨o ÷ ø `u ´u ^u ¨u ´y þ ¨y Some Characters Here are some of the characters in Latin-1 not used in writing English. binary oct dec Unicode 1100 0011 0303 195 A~ U+00C3 latin capital letter a with tilde 1101 0111 0327 215 × U+00D7 multiplication sign 1101 1111 0337 223 ß U+00DF latin small letter sharp s 1110 1101 0355 237 ´ı U+00ED latin small letter i with acute 1111 1110 0376 254 þ U+00FE latin small letter thorn An 8-bit character set is a convenient size and so US ASCII is for the most part replaced by Latin-1 which supports some European languages. Microsoft's CP1252 is somewhat similar. The new ISO 8859-15 (Latin-9) nicknamed Latin-0 updates Latin-1 by replacing eight infrequently used characters ¤¦¨´¼½¾. with left-out French letters (¨y,œ) and Finnish and Lithuanian letters (ˇs,ˇz),and placing the Euro sign e in the cell 0xA4 of the former (unspecified) currency sign ¤. ¤ U+00A4 currency sign ! e U+20AC euro sign ¦ U+00A6 broken bar ! Sˇ U+0160 latin capital letter s with caron ¨ U+00A8 diaeresis ! ˇs U+0161 latin small letter s with caron ´ U+00B4 acute accent ! Zˇ U+017D latin capital letter z with caron ¸ U+00B8 cedilla ! ˇz U+017E latin small letter z with caron ¼ U+00BC vulgar frac 1 quarter ! Œ U+0152 latin capital ligature oe ½ U+00BD vulgar fraction 1 half ! œ U+0153 latin small ligature oe ¾ U+00BE vulgar frac 3 quarters! Y¨ U+0178 latin capital letter y with diaeresis Top half of Latin-0 XXX XXX BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 DCS PU1 PU2 SST CCH MS SPA EPA SOS XXX SCI CSI ST OSC PM APC ¡ ¢ £ e ¥ Sˇ § ˇs © ª « : SHY ® ¯ ° ± 2 3 Zˇ µ ¶ · ˇz 1 º » Œœ Y¨¿ A` A´ A^ A~ A¨ A˚ Æ C¸ E` E´ E^ E¨ I` I´ I^ I¨ Ð N~ O` O´ O^ O~ O¨ × Ø U` U´ U^ U¨ Y´ Þ ß `a ´a ^a ~a ¨a ˚a æ ¸c `e ´e ^e ¨e `ı ´ı ^ı ¨ı đ ~n `o ´o ^o ~o ¨o ÷ ø `u ´u ^u ¨u ´y þ ¨y Differences in Character Encodings binary oct dec hex MacR 1252 Latin1 Latin0 0111 0011 0163 115 0x73 s s s s 1000 0000 0200 128 0x80 A¨ e XXX XXX 1000 0101 0205 135 0x85 O¨ ::: NEL NEL 1000 1010 0212 138 0x8A ¨a Sˇ VTS VTS 1010 0100 0244 164 0xA4 § ¤ ¤ e 1010 0110 0246 166 0xA6 ¶ ¦ ¦ Sˇ 1011 0110 0266 182 0xB6 @ ¶¶¶ 1101 1011 0333 219 0xDB e U^ U^ U^ 1110 0100 0344 228 0xE4 ‰ ¨a ¨a ¨a 1111 0011 0363 243 0xF3 U^ ´o ´o ´o Standards help insure that the bit patterns are understood the same way. But the applicable standard must be clearly known. Java Makes It Easy Indicate to the Scanner class which character encoding is to be expected, and Java will interpret the bytes correctly. This is because Java uses Unicode internally which is a superset of all commonly used character set encodings. Scanner s = new Scanner (System.in,"LATIN-1"); Scanner s = new Scanner (System.in,"Cp1252"); Without a specified character encoding, the computer's default encoding is used. Scanner s = new Scanner (System.in); A program with such a scanner may behave differently on different computers, leading to confusion. TIP All the input in lab will be US-ASCII, so please use the two argument form of the Scanner class at all times. Scanner s = new Scanner (System.in,"US-ASCII"); A bad bit pattern in the input (there won't be one) would result in a Unicode character for \bad character" { 0xFFFD replacement character { in your input. The behavior of the programs for the exercises in lab is never defined on any \bad" input whatsoever The program may do anything at all including loop or result in a runtime error. In particular the program does not have to detect \bad" input or report it. What Do Characters Mean? Some characters, like the tab, have no fixed meaning, even though it has an agreed upon code point. Tabs are interpreted differently by different applications leading to confusion. Six invisible or white-space characters are legal in a Java program. No other control characters are legal in a Java program. A Java program is permitted to end with the \substitute" character. binary oct dec Latin1 Unicode 0000 1001 0011 9 HT U+0009 horizontal tabulation 0000 1010 0012 10 LF U+000A line feed 0000 1100 0014 12 FF U+000C form feed 0000 1101 0015 13 CR U+000D carriage return 0001 1010 0032 26 SUB U+001A substitute 0010 0000 0040 32 U+0020 space There is no advantage to using a horizontal tabulation or a substitute character in a Java program. But there is a risk of breaking some application that uses Java source code for input (pretty-printers, text beautifiers, metric tools, etc.) Newlines indications are necessary for formatting programs, and Java permits all three of the common newline conventions: the line feed character (common in Unix applications), the carriage return (Mac applications), and the carriage return character followed by the line feed (Microsoft applications). MacOS CR "\r" Unix LF "\n" Windows CR,LF "\r\n" Other newline markers are much less common. Next-line (NEL x85) is used to mark end-of-line on some IBM mainframes. Unicode has its own approach to indicating a new line: Unicode U+2028 line separator The method readLine() of the Java class BufferedReader tolerates MacOS, Unix or Window (but not Unicode) new-line indications. The Java JVM initializes the new-line indication according to the platform it is running. This value is available through system properties: System.getProperty("line.separator"). The method format in class Formatter recognizes %n in format strings as the value of the line separator. // Use Unicode line separator character System.setProperty("line.separator","\u2028"); System.out.format("A line.%n"); Newline From Wikipedia: In computing a newline, also known as a line break or end-of-line (EOL) marker, is a special character or sequence of characters signifying the end of a line of text. There is also some confusion whether newlines terminate or separate lines. If a newline is considered a separator, there will be no newline after the last line of a file. The general convention on most systems is to add a newline even after the last line, i.e. to treat newline as a line terminator. Some programs have problems processing the last line of a file if it is not newline terminated. Newline Please consider the newline mark a line terminator. The number of newline marks in a file is the number of lines in the file. You can lose points on tests through this type of miscommunication. Unicode Versions version date scripts characters bits 1.0 Oct 1991 24 7,161 12.81 2.0 Jul 1996 24 38,950 15.25 3.0 Sep 1999 38 49,249 15.59 4.0 Apr 2003 52 96,447 16.56 5.0 Jul 2006 64 99,098 16.60 6.0 Oct 2010 93 109,449 16.74 7.0 Jun 2014 123 113,021 16.79 8.0 Jun 2015 123 120,737 16.88 9.0 Jun 2016 135 128,237 16.97 10.0 Jun 2017 139 136,690 17.06 11.0 Jun 2018 146 137,374 17.07 Here characters mean the number of encoded characters: graphic characters, format characters plus control characters.