Character Encoding - the Transformation Between Code Point and Its Own Code Units

Total Page:16

File Type:pdf, Size:1020Kb

Character Encoding - the Transformation Between Code Point and Its Own Code Units The ABC’s of Emoji stuartlau@github Sep, 2018 Quiz • Which language can use US-ASCII to encode all its characters? • How many characters can char represent in Java? • Can we use char to represent ‘�’ in Java? • What will return if you call “�”.length() and “�”.getBytes() in Java? • What about “�”? • Can we get the emoji calling “�c”.substring(0,1)? • Can we execute insert into tb(‘name’) values(‘�’) in MySQL? https://stuartlau.github.io Charactr How to Define Charactr? • Character set - defines all readable characters • Coded character set - use a code point to delegate a character in character repertoire • Character encoding - the transformation between code point and its own code units https://stuartlau.github.io Why we need character encoding? Code Point • A uniqued number assigned to each Unicode character • Usually expressed in Hexadecimal as U+xxxx, e.g. code point for A is U+0041 https://stuartlau.github.io Planes 17 Planes 136755 characters defined U+0000~U+10FFFF, 21-bit Supports over 1.1M possible characters https://stuartlau.github.io BMP • Basic Multilingual Plane, U+0000~U+FFFF, 65536 in total https://stuartlau.github.io Supplementary Characters • Code points between U+10000 and U+10FFFF are the supplementary characters • Can not be described as a single 16-bit entity https://stuartlau.github.io Character Encoding • A mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units • The most commonly used code units are bytes, but 16-bit, 32-bit integers can also be used for internal processing • UTF-32, UTF-16 and UTF-8 are character encoding schemas for the Unicode standard https://stuartlau.github.io UTF-32 • UTF-32 encodes each Unicode character as one 32-bit code units, e.g. A 00 00 00 41 • It’s the most convenient representation for internal processing • But it’s memory-wasting https://stuartlau.github.io UTF-16 • UTF-16 encodes each Unicode character as one or two 16-bit code units, U+0000~U+FFFF, 0~65535 • Each character is encoded using 2 or 4 bytes • The internal Java encoding • Code points between U+0000 and U+FFFF are represented as a 16-bit Java char value • e.g. U+4E2D - 中, 2 bytes, char c = ‘中’ • Code points between U+10000 and U+10FFFF are the supplementary characters which char in Java can not hold https://stuartlau.github.io Helo in UTF-16 00 48 00 65 00 6C 00 6C 00 6F H E L L O 6F 00 6C 00 6C 00 65 00 48 00 Endianness https://stuartlau.github.io BOM • BOM = Byte Order Mark • Appear at the start of Unicode text • Big Endian starts with U+FEFF • Little Endian starts with U+FFFE • UTF-16 and UTF-32 have to deal with the issue of BE and LE, because they use multi-byte code units https://stuartlau.github.io Java and Supplementary Characters • Unicode was originally designed as a fixed-width 16-bit character encoding • Java used to hold all Unicode characters using char • But later Unicode 3.1 has been extended up to 1,114,112, 21-bit character encoding • J2SE5.0 supports version 4.0 of Unicode standard https://stuartlau.github.io UTF-8 • 8-bit, variable-width encoding • Encodes each Unicode character using 1 to 4 bytes • .class files is encode using UTF-8 • No BOM needed https://stuartlau.github.io UTF-8 Encoding Availabl Byte1 Byte2 Byte3 Byte4 Sample e Bits 0xxxxxxx 7 abc 110xxxxx 10xxxxxx 11 āō 1110xxxx 10xxxxxx 10xxxxxx 16 汉字 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 �� https://stuartlau.github.io UTF-8 to UTF-16 • For those bytes: • starts with 0 • e.g. 0xxxxxxx ==> 00000000 0xxxxxxxx • starts with 110, • e.g. 110xxxxx 10yyyyyy ==> 00000xxx xxyyyyyy • starts with 1110 • e.g. 1110xxxx 10yyyyyy 10zzzzzz ==> xxxxyyyy yyzzzzzz • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io UTF-16 to UTF-8 • For those bytes: • Less than 0x007F(00000000 01111111) • e.g. 0x00000000 xxxxxx ==> 0xxxxxxxx • Less than 0x07FF(00000111 11111111), • e.g. 00000aaa bbbbbbbb ==> 110aaabb 10bbbbbb • Others • e.g. aaaaaaaa bbbbbbbb ==> 1110aaaa 10aaaabb 10bbcccccc • e.g. “中” • Unicode U+4E2D: 01001110 00101101 • UTF-8 4E B8 AD : 11100100 10111000 10101101 https://stuartlau.github.io Emoji History In 1999, Shigetaka Kurita created the first 180 emoji collection for a Japanese mobile web platform The Oxford Dictionary named the “face with tears of joy” as its word of the year for 2015 ABC • Sounds /ɪˈmoʊdʒi/ from Japanese • “e” (picture), “moji” (character) • As of 2017 there were 2,666 emoji on the official Unicode Standard list spread across 22 blocks https://stuartlau.github.io Design Guideline https://stuartlau.github.io Text&Colorful Shape • Emoji character can have two main kinds of presentation: • An emoji presentation, with colorful and perhaps whimsical shapes, even animated • A text presentation, such as black & white https://stuartlau.github.io Diversity • Emoji display varies in different OS, Apps, version, etc. • Even in the same App you may get different display https://stuartlau.github.io Vendor Implementations Skin tone • What’s the difference? • U+1F64E • U+1F64E • U+1F64E U+1F3FB U+1F3FF U+200D U+2640 U+FE0F https://stuartlau.github.io What’s the • U+1F3FF EMOJI MODIFIER FITZAPATRICK TYPE-6 • u+200D ZERO WEDITH JOINER • U+2640 FEMAL SIGN • U+FE0F VARIATION SELECTOR-16 https://stuartlau.github.io Emoji Modifiers • Emoji modifier - A character that can be used to modify the appearance of a preceding emoji in an emoji modifier sequence • Emoji modifier base - A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence • Emoji modifier sequence - A sequence of the following form: emoji_modifier_sequence := emoji_modifier_base emoji_modifier https://stuartlau.github.io Fitzpatrick Modifiers • When one of these characters follows certain characters, then a font should show the sequence as a single glyph with the specified skin tone • If the font doesn’t show the combined character, the user can still see that a skin tone was intended https://stuartlau.github.io Sample Use of Fitzpatrick Modifiers https://stuartlau.github.io Variation Selectors Variation Selectors • VS is a Unicode block containing 16 Variation Selector format characters(designated VS1 through VS16) • They are used to specify a specific glyph variant for a Unicode character • At present only standardized variation sequences with VS1, VS15 and VS16 have been defined https://stuartlau.github.io VS-15(U+FE0E) • An invisible code point which specifies that the preceding character should be rendered in a textual fashion https://stuartlau.github.io VS-16(U+FE0F) • An invisible code point which specifies that the preceding character should be displayed with emoji presentation • Only required if the preceding character defaults to text presentation • Often used in Emoji ZWJ Sequences, where one or more characters in the sequence have text and emoji presentation https://stuartlau.github.io ZWJ https://stuartlau.github.io Emoji ZWJ Sequences • ZERO WIDTH JOINER, U+0x200D • Joining characters as a single glyph if available • Behave like single emoji character, even though internally they are sequences https://stuartlau.github.io Example • The sequence U+1F468 �, U+200D ZWJ, U+1F469 �, U+200D ZWJ, U+1F467 � • could be displayed as a single emoji depicting a family & if the implementation supports it • or else system would ignore the ZWJs, show the base emoji in the sequence: #$% https://stuartlau.github.io Multi-Person Groupings https://stuartlau.github.io Gender Combinations • Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS • Others do not: KISS, COUPLE WITH HEART https://stuartlau.github.io Practice https://stuartlau.github.io “'”.length? • Actually compose with 4 emojis, 1 font variation and 3 ZWJs https://stuartlau.github.io Example “#”.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate “#”.codePointAt(1) // 56424 -> U+DC68 •The man Emoji has the code point U+1F468 •It can’t be represented in a single code unit in Java •That’s why a surrogate pair has to be used, making it consistent of two single code units https://stuartlau.github.io How does Java represent supplementary character? Surrogate Pair • It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65635 code points. This combination is called surrogate pair. • Leading Surrogate: U+D800~U+DB7F • Trailing Surrogate: U+DC00~U+DFFF • The values from U+D800 to U+DFFF are reserved for used in UTF-16, no characters are assigned to them as code points https://stuartlau.github.io Deep Dive • e.g. U+1F468 # • 0x1F468 - 0x10000 = 0xF468 • => 11110100 01101000 • Using 20-bit => 0000111101 0001101000 • 0xD800 + 0x3D = 0xD83D • 0xDC00 + 0x68 = 0xDC68 https://stuartlau.github.io Supplementary Encoding in UTF-16 • UTF-16 covers U+0000~U+FFFF using 2 bytes • For Unicode U(U+10000~U+10FFFF): • Minus 0x10000, get U’(0x00000~0xFFFFF), 20 bits • e.g. U’ = yyyyyyyyyyxxxxxxxxxx • Using W1 to represent the first 10 bits, • e.g. W1 = 110110yyyyyyyyyy, W1 in D800~DBFF • Using W2 to represent the second 10 bits, • e.g. W2 = 110111xxxxxxxxxx, W2 in DC00~DFFF https://stuartlau.github.io Tricks1 - native2ascii Tricks2 - Calculator https://stuartlau.github.io Tricks3 - Character Viewer Java API • String.codePointAt(int index):int • Character.highSurrogate(int codePoint):char • Character.lowSurrogate(int codePoint):char • Character.charCount(int codePoint):int • Character.isSupplementaryCodePoint(int codePoint):boolean • Character.isSurrogate(char):boolean • Character.isSurrogatePair(char, char):boolean • … https://stuartlau.github.io References • https://en.wikipedia.org/wiki/Emoji • http://stn.audible.com/abcs-of-unicode • https://twitter.github.io/twemoji/preview.html • http://www.unicode.org/reports/tr51/ • https://en.wikipedia.org/wiki/Fitzpatrick_scale • http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html • https://en.wikipedia.org/wiki/UTF-8 • https://vinoit.me/2016/10/07/codePoint-in-java-and-utf16/ • https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no- excuses/ https://stuartlau.github.io .
Recommended publications
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • UTR #25: Unicode and Mathematics
    UTR #25: Unicode and Mathematics http://www.unicode.org/reports/tr25/tr25-5.html Technical Reports Draft Unicode Technical Report #25 UNICODE SUPPORT FOR MATHEMATICS Version 1.0 Authors Barbara Beeton ([email protected]), Asmus Freytag ([email protected]), Murray Sargent III ([email protected]) Date 2002-05-08 This Version http://www.unicode.org/unicode/reports/tr25/tr25-5.html Previous Version http://www.unicode.org/unicode/reports/tr25/tr25-4.html Latest Version http://www.unicode.org/unicode/reports/tr25 Tracking Number 5 Summary Starting with version 3.2, Unicode includes virtually all of the standard characters used in mathematics. This set supports a variety of math applications on computers, including document presentation languages like TeX, math markup languages like MathML, computer algebra languages like OpenMath, internal representations of mathematics in systems like Mathematica and MathCAD, computer programs, and plain text. This technical report describes the Unicode mathematics character groups and gives some of their default math properties. Status This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. Please send comments to the authors. A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
    [Show full text]
  • Unicode Characters in Proofpower Through Lualatex
    Unicode Characters in ProofPower through Lualatex Roger Bishop Jones Abstract This document serves to establish what characters render like in utf8 ProofPower documents prepared using lualatex. Created 2019 http://www.rbjones.com/rbjpub/pp/doc/t055.pdf © Roger Bishop Jones; Licenced under Gnu LGPL Contents 1 Prelude 2 2 Changes 2 2.1 Recent Changes .......................................... 2 2.2 Changes Under Consideration ................................... 2 2.3 Issues ............................................... 2 3 Introduction 3 4 Mathematical operators and symbols in Unicode 3 5 Dedicated blocks 3 5.1 Mathematical Operators block .................................. 3 5.2 Supplemental Mathematical Operators block ........................... 4 5.3 Mathematical Alphanumeric Symbols block ........................... 4 5.4 Letterlike Symbols block ..................................... 6 5.5 Miscellaneous Mathematical Symbols-A block .......................... 7 5.6 Miscellaneous Mathematical Symbols-B block .......................... 7 5.7 Miscellaneous Technical block .................................. 7 5.8 Geometric Shapes block ...................................... 8 5.9 Miscellaneous Symbols and Arrows block ............................. 9 5.10 Arrows block ........................................... 9 5.11 Supplemental Arrows-A block .................................. 10 5.12 Supplemental Arrows-B block ................................... 10 5.13 Combining Diacritical Marks for Symbols block ......................... 11 5.14
    [Show full text]
  • Mathematical Symbols
    List of mathematical symbols This is a list of symbols used in all branches ofmathematics to express a formula or to represent aconstant . A mathematical concept is independent of the symbol chosen to represent it. For many of the symbols below, the symbol is usually synonymous with the corresponding concept (ultimately an arbitrary choice made as a result of the cumulative history of mathematics), but in some situations, a different convention may be used. For example, depending on context, the triple bar "≡" may represent congruence or a definition. However, in mathematical logic, numerical equality is sometimes represented by "≡" instead of "=", with the latter representing equality of well-formed formulas. In short, convention dictates the meaning. Each symbol is shown both inHTML , whose display depends on the browser's access to an appropriate font installed on the particular device, and typeset as an image usingTeX . Contents Guide Basic symbols Symbols based on equality Symbols that point left or right Brackets Other non-letter symbols Letter-based symbols Letter modifiers Symbols based on Latin letters Symbols based on Hebrew or Greek letters Variations See also References External links Guide This list is organized by symbol type and is intended to facilitate finding an unfamiliar symbol by its visual appearance. For a related list organized by mathematical topic, see List of mathematical symbols by subject. That list also includes LaTeX and HTML markup, and Unicode code points for each symbol (note that this article doesn't
    [Show full text]
  • Unicode Character 'AUTOMOBILE' (U+1F697)
    (/index.htm) Search You are in (/index.dir) FileFormat.Info (/index.htm) » (/info/index.dir) Info (/info/index.htm) » (/info/unicode/index.dir) Unicode (/info/unicode/index.htm) » (/info/unicode/char/index.dir) Characters (/info/unicode/char/index.htm) » (/info/unicode/char/1F697/index.dir) U+1F697 (/info/unicode/char/1F697/index.htm) Best Online CRM 2 Exercises To Never Do Password protect folders Free for 3 Users. Track your Sales and Never do these waist widening exercises Password protect & hide your files in just 3 Marketing Online. if you want to look ripped clicks! It's dead simple Zoho.com/CRM http://www.adonisgoldenratio.com www.safeplicity.com Unicode Character 'AUTOMOBILE' (U+1F697) (../1f696/index.htm) (../1f698/index.htm) Browser Test Page (browsertest.htm) Outline (as SVG file) (/info/unicode/char/1f697/automobile.svg) Fonts that support U+1F697 (fontsupport.htm) (browsertest.htm) Unicode Data Name AUTOMOBILE Block Transport and Map Symbols (/info/unicode/block/transport_and_map_symbols/index.htm) Category Symbol, Other [So] (/info/unicode/category/So/index.htm) Combine 0 BIDI Other Neutrals [ON] Mirror N Version Unicode 6.0.0 (October 2010) (/info/unicode/version/6.0/index.htm) Encodings Emoji (/info/emoji/index.htm) (/info/emoji/red_car/index.htm) :red_car: (/info/emoji/red_car/index.htm) HTML Entity (decimal) 🚗 HTML Entity (hex) 🚗 How to type in Microsoft Windows (/tip/microsoft/enter_unicode.htm) Alt +1F697 UTF-8 (../../utf8.htm) (hex) 0xF0 0x9F 0x9A 0x97 (f09f9a97) UTF-8 (binary) 11110000:10011111:10011010:10010111 UTF-16 (hex) 0xD83D 0xDE97 (d83dde97) UTF-16 (decimal) 55,357 56,983 UTF-32 (hex) 0x0001F697 (1F697) UTF-32 (decimal) 128,663 C/C++/Java source code "\uD83D\uDE97" Python source code u"\U0001F697" More..
    [Show full text]
  • Oriya Range: 0B00–0B7F
    Oriya Range: 0B00–0B7F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Unicodemath a Nearly Plain-Text Encoding of Mathematics Version 3.1 Murray Sargent III Microsoft Corporation 16-Nov-16
    Unicode Nearly Plain Text Encoding of Mathematics UnicodeMath A Nearly Plain-Text Encoding of Mathematics Version 3.1 Murray Sargent III Microsoft Corporation 16-Nov-16 1. Introduction ............................................................................................................ 2 2. Encoding Simple Math Expressions ...................................................................... 3 2.1 Fractions .......................................................................................................... 4 2.2 Subscripts and Superscripts........................................................................... 6 2.3 Use of the Blank (Space) Character ............................................................... 8 3. Encoding Other Math Expressions ........................................................................ 8 3.1 Delimiters ........................................................................................................ 8 3.2 Literal Operators ........................................................................................... 11 3.3 Prescripts and Above/Below Scripts ........................................................... 11 3.4 n-ary Operators ............................................................................................. 12 3.5 Mathematical Functions ............................................................................... 13 3.6 Square Roots and Radicals ........................................................................... 14 3.7 Enclosures ....................................................................................................
    [Show full text]
  • Fonts in Mpdf Version 5.X Mpdf Version 5 Supports Truetype Fonts, Reading and Embedding Directly from the .Ttf Font Files
    mPDF Fonts in mPDF Version 5.x mPDF version 5 supports Truetype fonts, reading and embedding directly from the .ttf font files. Fonts must follow the Truetype specification and use Unicode mapping to the characters. Truetype collections (.ttc files) and Opentype files (.otf) in Truetype format are also supported. EASY TO ADD NEW FONTS 1. Upload the Truetype font file to the fonts directory (/ttfonts) 2. Define the font file details in the configuration file (config_fonts.php) 3. Access the font by specifying it in your HTML code as the CSS font-family These are some examples of Windows fonts: Arial - The quick, sly fox jumped over the lazy brown dog. Comic Sans MS - The quick, sly fox jumped over the lazy brown dog. Trebuchet - The quick, sly fox jumped over the lazy brown dog. Calibri - The quick, sly fox jumped over the lazy brown dog. QuillScript - The quick, sly fox jumped over the lazy brown dog. Lucidaconsole - The quick, sly fox jumped over the lazy brown dog. Tahoma - The quick, sly fox jumped over the lazy brown dog. AlbaSuper - The quick, sly fox jumped over the lazy brown dog. FULL UNICODE SUPPORT The DejaVu fonts distributed with mPDF contain an extensive set of characters, but it is easy to add fonts to access uncommon characters. Georgian (DejaVuSansCondensed) Ⴀ Ⴁ Ⴂ Ⴃ Ⴄ Ⴅ Ⴆ Ⴇ Ⴈ Ⴉ Ⴊ Ⴋ Ⴌ Ⴍ Ⴎ Ⴏ Ⴐ Ⴑ Ⴒ Ⴓ Cherokee (Quivira) Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Runic (Junicode) ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ ᚧ ᚨ ᚩ ᚪ ᚫ ᚬ ᚭ ᚮ ᚯ ᚰ ᚱ ᚲ ᚳ ᚴ ᚵ ᚶ ᚷ ᚸ ᚹ ᚺ ᚻ ᚼ Greek Extended (Quivira) ἀ ἁ ἂ ἃ ἄ ἅ ἆ ἇ Ἀ Ἁ Ἂ Ἃ Ἄ Ἅ Ἆ Ἇ ἐ ἑ ἒ ἓ ἔ ἕ IPA Extensions (Quivira)
    [Show full text]
  • Pdflib Tutorial 9.0.1
    ABC PDFlib, PDFlib+PDI, PPS A library for generating PDF on the fly PDFlib 9.0.1 Tutorial For use with C, C++, Cobol, COM, Java, .NET, Objective-C, Perl, PHP, Python, REALbasic/Xojo, RPG, Ruby Copyright © 1997–2013 PDFlib GmbH and Thomas Merz. All rights reserved. PDFlib users are granted permission to reproduce printed or digital copies of this manual for internal use. PDFlib GmbH Franziska-Bilek-Weg 9, 80339 München, Germany www.pdflib.com phone +49 • 89 • 452 33 84-0 fax +49 • 89 • 452 33 84-99 If you have questions check the PDFlib mailing list and archive at tech.groups.yahoo.com/group/pdflib Licensing contact: [email protected] Support for commercial PDFlib licensees: [email protected] (please include your license number) This publication and the information herein is furnished as is, is subject to change without notice, and should not be construed as a commitment by PDFlib GmbH. PDFlib GmbH assumes no responsibility or lia- bility for any errors or inaccuracies, makes no warranty of any kind (express, implied or statutory) with re- spect to this publication, and expressly disclaims any and all warranties of merchantability, fitness for par- ticular purposes and noninfringement of third party rights. PDFlib and the PDFlib logo are registered trademarks of PDFlib GmbH. PDFlib licensees are granted the right to use the PDFlib name and logo in their product documentation. However, this is not required. Adobe, Acrobat, PostScript, and XMP are trademarks of Adobe Systems Inc. AIX, IBM, OS/390, WebSphere, iSeries, and zSeries are trademarks of International Business Machines Corporation.
    [Show full text]
  • Quivira Private Use Area
    Quivira 4.1 Private Use Area The Private Use Area consists of 6,400 Codepoints which will never be assigned to any characters in the Unicode Standard. They are meant to be used for own characters in individual fonts. The character names used in this documents are own inventions and not standardised in any way. The regular names of these characters according to Unicode are “PRIVATE USE CHARACTER-” followed by the codepoint of each character, i.e. they range from “PRIVATE USE CHARACTER-E000” to “PRIVATE USE CHARACTER-F8FF”. Accordingly, the blocks and their names are also self-invented. These names always start with “Private Use Area:” to make sure that they can never collide with a real Unicode block. The assignments are mostly stable, but characters may occasionally be removed from the Private Use Area if they are defined in an official Unicode block. In this case they are marked as undefined in this document, with a reference to the new codepoint. Characters written in red are new in Quivira 4.1. For a complete overview of all characters in Quivira see Quivira.pdf; for meanings and usage of characters in other blocks see http://www.unicode.org. Private Use Area: Playing Card Symbols 0E000 – 0E00F 16 characters since version 3.5 0 1 2 3 4 5 6 7 8 9 A B C D E F 0E00 → 02660 ♠ BLACK SPADE SUIT 0E005 Swiss Playing Card Symbol Roses → 02661 ♡ WHITE HEART SUIT • “Rosen” in Swiss German → 02662 ♢ WHITE DIAMOND SUIT 0E006 Spanish Playing Card Symbol Clubs → 02663 ♣ BLACK CLUB SUIT • “Bastos” in Spanish, “Bastoni” in → 02664
    [Show full text]
  • A Modular Architecture for Unicode Text Compression
    A modular architecture for Unicode text compression Adam Gleave St John’s College A dissertation submitted to the University of Cambridge in partial fulfilment of the requirements for the degree of Master of Philosophy in Advanced Computer Science University of Cambridge Computer Laboratory William Gates Building 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom Email: [email protected] 22nd July 2016 Declaration I Adam Gleave of St John’s College, being a candidate for the M.Phil in Advanced Computer Science, hereby declare that this report and the work described in it are my own work, unaided except as may be specified below, and that the report does not contain material that has already been used to any substantial extent for a comparable purpose. Total word count: 11503 Signed: Date: This dissertation is copyright c 2016 Adam Gleave. All trademarks used in this dissertation are hereby acknowledged. Acknowledgements I would like to thank Dr. Christian Steinruecken for his invaluable advice and encouragement throughout the project. I am also grateful to Prof. Zoubin Ghahramani for his guidance and suggestions for extensions to this work. I would further like to thank Olivia Wiles and Shashwat Silas for their comments on drafts of this dissertation. I would also like to express my gratitude to Maria Lomelí García for a fruitful discussion on the relationship between several stochastic processes considered in this dissertation. Abstract Conventional compressors operate on single bytes. This works well on ASCII text, where each character is one byte. However, it fares poorly on UTF-8 texts, where characters can span multiple bytes.
    [Show full text]
  • The Apple Font Tool Suite Tutorial
    The Apple Font Tool Suite Tutorial Version 1.0 Copyright © 2002 Apple Computer, Inc. All rights reserved. i Table of Contents Table of Contents...........................................................................................................2 Introduction ...................................................................................................................3 Lesson One: Filling Out the Glyph Repertoire..............................................................6 Lesson Two: Using Add Lists ...................................................................................... 19 Lesson Three: Completing the tables ........................................................................... 33 Lesson Four: Metamorphosis Input Files (MIFs) ....................................................... 51 2 Introduction This tutorial is a general introduction to the Mac OS X Apple Font Tool Suite, illustrating the various techniques needed to work with a real-life font. Further documentation on each of the tools can be found in the Apple Font Tool Suite document and the Quick Reference, both of which are installed by the installer. There is also a text file, ‘Tutorial Command Summary.txt’, which contains all the command lines found in this Tutorial and is handy for cutting and pasting into the terminal window. The tutorial includes a basic font, Apple Simple.ttf, which was created by Apple for demonstration purposes only. The tutorial can be usefully worked through in conjunction with a general font editor such as Fontographer,
    [Show full text]