Unicode Explained Ebook Free Download

Total Page:16

File Type:pdf, Size:1020Kb

Unicode Explained Ebook Free Download UNICODE EXPLAINED PDF, EPUB, EBOOK Jukka K. Korpela | 800 pages | 01 Jun 2006 | O'Reilly Media, Inc, USA | 9780596101213 | English | Sebastopol, United States Unicode Explained PDF Book You are not supposed to hand code the processing of , different characters. Yes but then what? Join k Monthly Readers Enjoy the article? Home Articles Popular Calculus. For a really universal and unambiguous notation for them, I think we would need something markup-like, like using Ctrl x to indicate typing x when the Ctrl key is held down. Today, software engineers need to know not only how to program effectively but also how to …. This Stack Overflow article does a good job of explaining what a code point is:. I've a request from a developer concerning whether Tcl is capable of handling characters larger than the Unicode BMP. Examples and practices described in this page don't take advantage of improvements introduced in later releases and might use technology no longer available. The Unicode standard defines such a code by using character encoding. However, this design was necessary — ASCII was a standard, and if Unicode was to be adopted by the Western world it needed to be compatible, without question. Book description Fundamentally, computers just deal with numbers. UTF-8 saves space. In some situations, you can read e. One could say that Unicode was once open to the inclusion of precomposed characters as needed, but was then closed, after all "important" languages had been covered. Character on my machine was the same as Character on yours. Now, the majority of common languages fit into the first codepoints, which can be stored as 2 bytes. The middle section offers more detailed information about using Unicode and other character codes. Its thus a character with well-defined semantics, quite independent of the name. Storing data in multiple bytes leads to my favorite conundrum: byte order! In the end, the other parts of the world began creating their own encoding schemes, and things started to get a little bit confusing. You would have to use methods above the character level to have them display differently, and this would be too clumsy for many purposes. Some non-Unicode encodings for them are rather efficient, since they have been optimized for such use. The most obvious optimization is to discard the indexing array if the byte count and the character count turn out to be equal. Unicode has been criticized on several accounts, from very different perspectives. The first few chapters provide you with a tutorial presentation of Unicode and character data. If you have Modern Greek text, for example, you can represent it in some 8-bit encoding, using just one octet per character. Some names for characters in scripts that are not well known in the Western world are just wrong: a name might be one that is commonly used to refer to a character in the script, but to another one. The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible. ASCII text is stored identically and efficiently. All rights reserved. When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations. UTF-8 has 10 illegal bytes out of , or 3. Click here to find out more. The history of character codes is largely a story of extensions, starting from a very limited set of characters that were suitable for some technical needs and for coarse writing of English. Originally, Unicode was squeezed into 16 bits at the cost of omitting a large number of less important CJK characters and "unifying" different characters into one. Paul Leahy. Normalization of code-point sequences. Trouble with this is that the cursor can go between combining characters, along with similar problems like cutting a string in the middle of what's called a grapheme cluster. Page , the CSS code sample: read div. In some situations, you can get Unicode characters through if you use an attachment or use the HTML format. When the remaining count is zero, it's known that the remaining characters are one byte each, so jump straight to the character. Unicode Explained Writer If the whole computer industry uses the same character encoding scheme, every computer can display the same characters. The Unicode policy in this issue is understandable, however. Unicode is an interesting study. Consider UTF as an example. The word "virgule" is rare, but "shilling" is worse. One idea has many possible encodings. In UTF, all commonly used characters take two octets. Page , third paragraph says that the ISO standard has not been put onto the Web. Another potential optimization is to scan backwards from the start of the next segment or end of string if the character index modulo n is greater than some threshold. The issue of Han unification is not, however, a case of East Asian peoples against the Western world. Unicode enables a single software product or website to be targeted across multiple platforms, languages and countries without re-engineering. For example, to encode the characters we looked at earlier:. However, for a little, while depending on where you were, there might have been a different character displayed for the same ASCII code. Someone finally got fed up with seeing gobbledygook in their documents and decided to create Unicode to unify all these encodings. It defines the way individual characters are represented in text files, web pages , and other types of documents. Upon reading a character outside the subset, it may indicate, in some suitable way showing a question mark in a box, its inability to display the character. Emojis are Unicode astral plane characters, and they provide a way to have images on your screen without actually having real images, just font glyphs. Although the Unicode databases specify many properties of characters, there is no single and uniform source of information on their identity and meaning usage. Thank You We just sent you an email to confirm your email address. It makes it easier for people to recognize which character they wish to use, when they need not look for tiny differences. Excessive Unification The unification principles and practices have raised many objections. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were supposed to use. Last Name:. This document is written in UTF-8, for example. This often involved changes such as simplification in the shapes of characters. Option 2: Everyone agrees to a byte order mark BOM , a header at the top of each file. Once you confirm your address, you will begin to receive the newsletter. Is space or processing power more important when reading XML documents? Since Java SE v5. There is processing to be done on every Unicode character, but this is a reasonable tradeoff. Rago For more than twenty years, serious C programmers have relied on one book for practical, in-depth …. TechTerms Newsletter Get featured terms and quizzes in your inbox. Your age? Thus, bytes 0x f8 and greater are illegal. For a really universal and unambiguous notation for them, I think we would need something markup-like, like using Ctrl x to indicate typing x when the Ctrl key is held down. In some situations, you can read e. It has several character encoding forms:. In programming, unification might seem to make things simpler, since there are fewer different characters to be considered. I imagine most programs will want grapheme clusters to be atomic. They map the numeric values to various Western characters and control codes newline, tab, etc. UTF is pretty clever, eh? Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Characters as Units of Text Characters as abstractions Variation of appearance or different characters? This involves swapping every byte in the file. Unicode Explained Reviews Why 7 bits? A mixture of fonts was used see the Colophon on page , causing typographic misfits at times. For generalities, see also my blog entry The Paradox of Unicode Adoption : Unicode works in casual memos but not in books. As of ES String. This is an open issue in Unicode. ASCII was created to help with this and is essentially a lookup table of bytes to characters. Click here to find out more. Paul Leahy is a computer programmer with over a decade of experience working in the IT industry, as both an in-house and vendor-based developer. Each plane holds 65, code points. There is an international standard, ISO , that specifies a method for typing a character by its code number in a rather similar way, though using a different specific technique. Thus, a program that supports Unicode may well support only a subset of Unicode characters. At each step, care was taken to guarantee efficient processing of already encoded characters, thereby often making the processing of new characters less efficient. Regarding Methods Using the Alt key on Windows on p. Unicode is a standard for coding multilingual text. In some situations, you can read e. As a consequence, the properties of the character cannot be very descriptive, since they need to take both uses into account. When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations. In reality, however, you can process Unicode data without making your application Unicode-conforming. The number changed to 1, due to the addition of 4 Sindhi characters.
Recommended publications
  • Base64 Character Encoding and Decoding Modeling
    Base64 Character Encoding and Decoding Modeling Isnar Sumartono1, Andysah Putera Utama Siahaan2, Arpan3 Faculty of Computer Science,Universitas Pembangunan Panca Budi Jl. Jend. Gatot Subroto Km. 4,5 Sei Sikambing, 20122, Medan, Sumatera Utara, Indonesia Abstract: Security is crucial to maintaining the confidentiality of the information. Secure information is the information should not be known to the unreliable person, especially information concerning the state and the government. This information is often transmitted using a public network. If the data is not secured in advance, would be easily intercepted and the contents of the information known by the people who stole it. The method used to secure data is to use a cryptographic system by changing plaintext into ciphertext. Base64 algorithm is one of the encryption processes that is ideal for use in data transmission. Ciphertext obtained is the arrangement of the characters that have been tabulated. These tables have been designed to facilitate the delivery of data during transmission. By applying this algorithm, errors would be avoided, and security would also be ensured. Keywords: Base64, Security, Cryptography, Encoding I. INTRODUCTION Security and confidentiality is one important aspect of an information system [9][10]. The information sent is expected to be well received only by those who have the right. Information will be useless if at the time of transmission intercepted or hijacked by an unauthorized person [7]. The public network is one that is prone to be intercepted or hijacked [1][2]. From time to time the data transmission technology has developed so rapidly. Security is necessary for an organization or company as to maintain the integrity of the data and information on the company.
    [Show full text]
  • Unicode Ate My Brain
    UNICODE ATE MY BRAIN John Cowan Reuters Health Information Copyright 2001-04 John Cowan under GNU GPL 1 Copyright • Copyright © 2001 John Cowan • Licensed under the GNU General Public License • ABSOLUTELY NO WARRANTIES; USE AT YOUR OWN RISK • Portions written by Tim Bray; used by permission • Title devised by Smarasderagd; used by permission • Black and white for readability Copyright 2001-04 John Cowan under GNU GPL 2 Abstract Unicode, the universal character set, is one of the foundation technologies of XML. However, it is not as widely understood as it should be, because of the unavoidable complexity of handling all of the world's writing systems, even in a fairly uniform way. This tutorial will provide the basics about using Unicode and XML to save lots of money and achieve world domination at the same time. Copyright 2001-04 John Cowan under GNU GPL 3 Roadmap • Brief introduction (4 slides) • Before Unicode (16 slides) • The Unicode Standard (25 slides) • Encodings (11 slides) • XML (10 slides) • The Programmer's View (27 slides) • Points to Remember (1 slide) Copyright 2001-04 John Cowan under GNU GPL 4 How Many Different Characters? a A à á â ã ä å ā ă ą a a a a a a a a a a a Copyright 2001-04 John Cowan under GNU GPL 5 How Computers Do Text • Characters in computer storage are represented by “small” numbers • The numbers use a small number of bits: from 6 (BCD) to 21 (Unicode) to 32 (wchar_t on some Unix boxes) • Design choices: – Which numbers encode which characters – How to pack the numbers into bytes Copyright 2001-04 John Cowan under GNU GPL 6 Where Does XML Come In? • XML is a textual data format • XML software is required to handle all commercially important characters in the world; a promise to “handle XML” implies a promise to be international • Applications can do what they want; monolingual applications can mostly ignore internationalization Copyright 2001-04 John Cowan under GNU GPL 7 $$$ £££ ¥¥¥ • Extra cost of building-in internationalization to a new computer application: about 20% (assuming XML and Unicode).
    [Show full text]
  • A Decision Procedure for String to Code Point Conversion‹
    A Decision Procedure for String to Code Point Conversion‹ Andrew Reynolds1, Andres Notzli¨ 2, Clark Barrett2, and Cesare Tinelli1 1 Department of Computer Science, The University of Iowa, Iowa City, USA 2 Department of Computer Science, Stanford University, Stanford, USA Abstract. In text encoding standards such as Unicode, text strings are sequences of code points, each of which can be represented as a natural number. We present a decision procedure for a concatenation-free theory of strings that includes length and a conversion function from strings to integer code points. Furthermore, we show how many common string operations, such as conversions between lowercase and uppercase, can be naturally encoded using this conversion function. We describe our implementation of this approach in the SMT solver CVC4, which contains a high-performance string subsolver, and show that the use of a native procedure for code points significantly improves its performance with respect to other state-of-the-art string solvers. 1 Introduction String processing is an important part of many kinds of software. In particular, strings often serve as a common representation for the exchange of data at interfaces between different programs, between different programming languages, and between programs and users. At such interfaces, strings often represent values of types other than strings, and developers have to be careful to sanitize and parse those strings correctly. This is a challenging task, making the ability to automatically reason about such software and interfaces appealing. Applications of automated reasoning about strings include finding or proving the absence of SQL injections and XSS vulnerabilities in web applications [28, 25, 31], reasoning about access policies in cloud infrastructure [7], and generating database tables from SQL queries for unit testing [29].
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting
    SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting Jason (Jianduan) Liang SAS certified: Platform Administrator, Advanced Programmer for SAS 9 Agenda Introduction UTF-8 and other encodings SAS options for encoding and configuration Other Considerations for UTF-8 data Encoding issues troubleshooting techniques (tips) Introduction What is UTF-8? . A character encoding capable of encoding all possible characters Why UTF-8? . Dominant encoding of the www (86.5%) SAS system options for encoding . Encoding – instructs SAS how to read, process and store data . Locale - instructs SAS how to present or display currency, date and time, set timezone values UTF-8 and other Encodings ASSCII (American Standard Code for Information Interchange) . 7-bit . 128 - character set . Examples (code point-char-hex): 32-Space-20; 63-?-3F; 64-@-40; 65-A-41 UTF-8 and other Encodings ISO 8859-1 (Latin-1) for Western European languages Windows-1252 (Latin-1) for Western European languages . 8-bit (1 byte, 256 character set) . Identical to asscii for the first 128 chars . Extended ascii chars examples: . 155-£-A3; 161- ©-A9 . SAS option encoding value: wlatin1 (latin1) UTF-8 and other Encodings UTF-8 and other Encodings Problems . Only covers English and Western Europe languages, ISO-8859-2, …15 . Multiple encoding is required to support national languages . Same character encoded differently, same code point represents different chars Unicode . Unicode – assign a unique code/number to every possible character of all languages . Examples of unicode points: o U+0020 – Space U+0041 – A o U+00A9 - © U+C3BF - ÿ UTF-8 and other Encodings UTF-8 .
    [Show full text]
  • JS Character Encodings
    JS � Character Encodings Anna Henningsen · @addaleax · she/her 1 It’s good to be back! 2 ??? https://travis-ci.org/node-ffi-napi/get-symbol-from-current-process-h/jobs/641550176 3 So … what’s a character encoding? People are good with text, computers are good with numbers Text List of characters “Encoding” List of bytes List of integers 4 So … what’s a character encoding? People are good with text, computers are good with numbers Hello [‘H’,’e’,’l’,’l’,’o’] 68 65 6c 6c 6f [72, 101, 108, 108, 111] 5 So … what’s a character encoding? People are good with text, computers are good with numbers 你好! [‘你’,’好’] ??? ??? 6 ASCII 0 0x00 <NUL> … … … 65 0x41 A 66 0x42 B 67 0x43 C … … … 97 0x61 a 98 0x62 b … … … 127 0x7F <DEL> 7 ASCII ● 7-bit ● Covers most English-language use cases ● … and that’s pretty much it 8 ISO-8859-*, Windows code pages ● Idea: Usually, transmission has 8 bit per byte available, so create ASCII-extending charsets for more languages ISO-8859-1 (Western) ISO-8859-5 (Cyrillic) Windows-1251 (Cyrillic) (aka Latin-1) … … … … 0xD0 Ð а Р 0xD1 Ñ б С 0xD2 Ò в Т … … … … 9 GBK ● Idea: Also extend ASCII, but use 2-byte for Chinese characters … … 0x41 A 0x42 B … … 0xC4 0xE3 你 0xC4 0xE4 匿 … … 10 https://xkcd.com/927/ 11 Unicode: Multiple encodings! 4d c3 bc 6c 6c (UTF-8) U+004D M “Müll” U+00FC ü 4d 00 fc 00 6c 00 6c 00 (UTF-16LE) U+006C l U+006C l 00 4d 00 fc 00 6c 00 6c (UTF-16BE) 12 Unicode ● New idea: Don’t create a gazillion charsets, and drop 1-byte/2-byte restriction ● Shared character set for multiple encodings: U+XXXX with 4 hex digits, e.g.
    [Show full text]
  • San José, October 2, 2000 Feel Free to Distribute This Text
    San José, October 2, 2000 Feel free to distribute this text (version 1.2) including the author’s email address ([email protected]) and to contact him for corrections and additions. Please do not take this text as a literal translation, but as a help to understand the standard GB 18030-2000. Insertions in brackets [] are used throughout the text to indicate corresponding sections of the published Chinese standard. Thanks to Markus Scherer (IBM) and Ken Lunde (Adobe Systems) for initial critical reviews of the text. SUMMARY, EXPLANATIONS, AND REMARKS: CHINESE NATIONAL STANDARD GB 18030-2000: INFORMATION TECHNOLOGY – CHINESE IDEOGRAMS CODED CHARACTER SET FOR INFORMATION INTERCHANGE – EXTENSION FOR THE BASIC SET (信息技术-信息交换用汉字编码字符集 Xinxi Jishu – Xinxi Jiaohuan Yong Hanzi Bianma Zifuji – Jibenji De Kuochong) March 17, 2000, was the publishing date of the Chinese national standard (国家标准 guojia biaozhun) GB 18030-2000 (hereafter: GBK2K). This standard tries to resolve issues resulting from the advent of Unicode, version 3.0. More specific, it attempts the combination of Uni- code's extended character repertoire, namely the Unihan Extension A, with the character cov- erage of earlier Chinese national standards. HISTORY The People’s Republic of China had already expressed her fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/ Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their “common” standard, GB 13000.1 adopted these changes subsequently. In order to remain compatible with GB 2312, however, which at the time of publishing Unicode/GB 13000.1 was an already existing national standard widely used to represent the Chinese “simplified” characters, the “specification” GBK was created.
    [Show full text]
  • Japanese Bibliographic Records and CJK Cataloging in U.S
    San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research Fall 2009 Japanese bibliographic records and CJK cataloging in U.S. university libraries. Mie Onnagawa San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_theses Recommended Citation Onnagawa, Mie, "Japanese bibliographic records and CJK cataloging in U.S. university libraries." (2009). Master's Theses. 4010. DOI: https://doi.org/10.31979/etd.pcb8-mryq https://scholarworks.sjsu.edu/etd_theses/4010 This Thesis is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Theses by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. JAPANESE BIBLIOGRAPHIC RECORDS AND CJK CATALOGING IN U.S. UNIVERSITY LIBRARIES A Thesis Presented to The Faculty of the School of Library and Information Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Library and Information Science by Mie Onnagawa December 2009 UMI Number: 1484368 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMT Dissertation Publishing UM! 1484368 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code.
    [Show full text]
  • AIX Globalization
    AIX Version 7.1 AIX globalization IBM Note Before using this information and the product it supports, read the information in “Notices” on page 233 . This edition applies to AIX Version 7.1 and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright International Business Machines Corporation 2010, 2018. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this document............................................................................................vii Highlighting.................................................................................................................................................vii Case-sensitivity in AIX................................................................................................................................vii ISO 9000.....................................................................................................................................................vii AIX globalization...................................................................................................1 What's new...................................................................................................................................................1 Separation of messages from programs..................................................................................................... 1 Conversion between code sets.............................................................................................................
    [Show full text]
  • Plain Text & Character Encoding
    Journal of eScience Librarianship Volume 10 Issue 3 Data Curation in Practice Article 12 2021-08-11 Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson Pennsylvania State University Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/jeslib Part of the Scholarly Communication Commons, and the Scholarly Publishing Commons Repository Citation Erickson S. Plain Text & Character Encoding: A Primer for Data Curators. Journal of eScience Librarianship 2021;10(3): e1211. https://doi.org/10.7191/jeslib.2021.1211. Retrieved from https://escholarship.umassmed.edu/jeslib/vol10/iss3/12 Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Journal of eScience Librarianship by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ISSN 2161-3974 JeSLIB 2021; 10(3): e1211 https://doi.org/10.7191/jeslib.2021.1211 Full-Length Paper Plain Text & Character Encoding: A Primer for Data Curators Seth Erickson The Pennsylvania State University, University Park, PA, USA Abstract Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability.
    [Show full text]
  • Fun with Unicode - an Overview About Unicode Dangers
    Fun with Unicode - an overview about Unicode dangers by Thomas Skora Overview ● Short Introduction to Unicode/UTF-8 ● Fooling charset detection ● Ambigiuous Encoding ● Ambigiuous Characters ● Normalization overflows your buffer ● Casing breaks your XSS filter ● Unicode in domain names – how to short payloads ● Text Direction Unicode/UTF-8 ● Unicode = Character set ● Encodings: – UTF-8: Common standard in web, … – UTF-16: Often used as internal representation – UTF-7: if the 8th bit is not safe – UTF-32: yes, it exists... UTF-8 ● Often used in Internet communication, e.g. the web. ● Efficient: minimum length 1 byte ● Variable length, up to 7 bytes (theoretical). ● Downwards-compatible: First 127 chars use ASCII encoding ● 1 Byte: 0xxxxxxx ● 2 Bytes: 110xxxxx 10xxxxxx ● 3 Bytes: 1110xxxx 10xxxxxx 10xxxxxx ● ...got it? ;-) UTF-16 ● Often used for internal representation: Java, .NET, Windows, … ● Inefficient: minimum length per char is 2 bytes. ● Byte Order? Byte Order Mark! → U+FEFF – BOM at HTML beginning overrides character set definition in IE. ● Y\x00o\x00u\x00 \x00k\x00n\x00o\x00w\x00 \x00t\x00h\x00i\x00s\x00?\x00 UTF-7 ● Unicode chars in not 8 bit-safe environments. Used in SMTP, NNTP, … ● Personal opinion: browser support was an inside job of the security industry. ● Why? Because: <script>alert(1)</script> == +Adw-script+AD4-alert(1)+ADw-/script+AD4- ● Fortunately (for the defender) support is dropped by browser vendors. Byte Order Mark ● U+FEFF ● Appears as:  ● W3C says: BOM has priority over declaration – IE 10+11 just dropped this insecure behavior, we should expect that it comes back. – http://www.w3.org/International/tests/html-css/character- encoding/results-basics#precedence – http://www.w3.org/International/questions/qa-byte-order -mark.en#bomhow ● If you control the first character of a HTML document, then you also control its character set.
    [Show full text]