XML and Semantic Web Technologies II. XML / 1. Unicode, Uris, And

Total Page:16

File Type:pdf, Size:1020Kb

XML and Semantic Web Technologies II. XML / 1. Unicode, Uris, And XML and Semantic Web Technologies XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Economics and Information Systems & Institute of Computer Science University of Hildesheim http://www.ismll.uni-hildesheim.de Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax 1. Unicode 2. Uniform Resource Identifiers (URIs) 3. XML Syntax Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Semantic Web Layer Cake Figure 1: Semantic Web Layers (Berners-Lee). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Coded Character Sets name codes examples ASCII code 0–127 64 A 7→ ISO-8859-1, ISO-LATIN-1 0–255 0–127 as ASCII, 196 7→ ISO-8859-7 0–255 0–127 as ASCII, 225 α 7→ Unicode 0–(232 1) 0–255 as ISO-8859-1 − Unicode is organized in 256 groups à 256 planes à 256 rows à 256 cells. Plane 0 (codes 0–65535) is called basis multilingual plane (BMP). Non ISO-8859-1 characters are mapped to higher codes, e.g., 945 α. 7→ Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 2/42 XML and Semantic Web Technologies / 1. Unicode Unicode Assigned characters of the Unicode standard (v6.0.0, 2011) can be found at http://www.unicode.org/charts/. Unicode also specifies character classes for each charac- ter, as letters (capital and small), • digits, • punctuation, • control characters. • Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 3/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Scripts Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 4/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Symbols Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 5/42 XML and Semantic Web Technologies / 1. Unicode Character Encoding Schemata coded character codes character byte characters character set (natural numbers) encoding schema sequences Character Encoding Schemata are trivial for 1-byte coded character sets. Direct representations of Unicode: UCS-2: direct representation of codes 0–65535 with 2 bytes. UCS-4: direct representation of all codes with 4 bytes. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 6/42 XML and Semantic Web Technologies / 1. Unicode Drawbacks of direct representations: bytecode 0x00 occurs (that marks string endings in • C), e.g., in UCS-4: A 65 (0, 0, 0, 65) 7→ 7→ uniform blow-up of storage space, but most texts • mostly use ASCII or ISO-8859-1. error-prone, as if one byte is lost, all following data will • be decoded incorrectly. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 7/42 XML and Semantic Web Technologies / 1. Unicode Unicode Transformation Formats (UTF) Unicode Transformation Formats (UTF) use a variable number of bytes for coding a character. UTF-8: 0x00-0x7f (bit sequences 0........) code ASCII characters directly, 0xc0-0xfd (bit sequences 11.......) mark the start of a multi-byte character repre- sentation (and code its length and leading bits of its code), 0x80-0xbf (bit sequences 10.......) code continuations of multi-byte character representations, 0xfe, 0xff (bit sequences 1111111.) are not used. bit sequence bytes free bits character codes 0....... 1 7 0x00– 0x7f 110..... 10...... 2 5 + 6 = 11 0x80– 0x7ff 1110.... 10...... 10...... 3 4 + 2 6 = 16 0x800– 0xffff · 11110... 10...... 10...... 10...... 4 3 + 3 6 = 21 0x10000– 0x1fffff · 111110.. 10...... 10...... 10...... 10...... 5 2 + 4 6 = 26 0x200000– 0x3ffffff · 1111110. 10...... 10...... 10...... 10...... 10...... 6 1 + 5 6 = 31 0x4000000–0xffffffff · Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 8/42 XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax 1. Unicode 2. Uniform Resource Identifiers (URIs) 3. XML Syntax Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Uniform Resource Identifiers (URIs) URIs are used to identify resources. Example: http://www.ismll.uni-hildesheim.de/lehre/xml-09s/index.html URIs are defined in RFC 3986 (01/2005). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Generic URI syntax /FE1!G/ IH "!$# % '&)(*",+-.!0/132*(4!05687*/ BD D ¦=<>?<> ¡ ¡ ¤ ¤ >Q ¢¡£¡¥¤§¦ ¡ M7ON P .!GO# N R L AWV SUT 9BAC: @ :- @KJ 9¢¨;: ¨ © | {z } | {z } | {z } | {z } | {z } :- 9BA-X:3V J*T | {z } ¨ © ZY[¨ F©FX\?© A£: @ @KJ | {z } ¤¦¢¨ ! ¡¦"¢$#&%(' £¢*)!+-,.¢/ ' 102%43 , ¡£¢¥¤§¦©¨ 56 879 ;:< =?>6 @:BADCFE £ | {z } | {z } Figure 6: Typical parts of URIs. Generic URI syntax: URI := scheme : scheme-specific-part h i h i h i Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 10/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Hierarchical URIs An URI is called hierarchical iff scheme-specific-part :=( // authority [ path ] h i h i h i | path )[ ? query ][ # fragment h i h i h i ] path :=( / path-segment )+ h i h i otherwise its called opaque. The path-segments . and .. have special meaning: context path and parent path. A hierarchical URI is called server-based iff authority :=[ userinfo @ ] host [ : port ] h i h i h i h i otherwise it is called registry-based. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 11/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Fragment identifiers Fragment identifiers are used to identify parts of the resource identified by an URI. Example: http://www.informatik.uni-freiburg.de/xml/books.html#R03 1 <html> 2 <body> 3 <li><a name="EE04">Rainer Eckstein, Silke Eckstein: 4 <em>XML und Datenmodellierung</em>, 2004.</a></li> 5 <li><a name="R03">Erik T. Ray: 6 <em>Learning XML</em>, 2003.</a></li> 7 </body> 8 </html> Figure 7: HTML document at http://www.informatik.uni-freiburg.de/xml/books.html. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 12/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Relative (hierarchical) URIs A relative URI is defined as: relativeURI ::=( // authority [ path ] h i h i h i | path h i | relativePath )[ ? query ] h i h i relativePath := path-segment ( / path-segment )* h i h i h i .----------------------------------------------------------. | .----------------------------------------------------. | | | .----------------------------------------------. | | | | | .----------------------------------------. | | | | | | | .----------------------------------. | | | | | | | | | <relative_reference> | | | | | | | | | ‘----------------------------------’ | | | | | | | | (5.1.1) Base URI embedded in the | | | | | | | | document’s content | | | | | | | ‘----------------------------------------’ | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | | | (message, document, or none). | | | | | ‘----------------------------------------------’ | | | | (5.1.3) URI used to retrieve the entity | | | ‘----------------------------------------------------’ | | (5.1.4) Default Base URI is application-dependent | ‘----------------------------------------------------------’ Figure 8: A Base URI is the context for resolving relative URIs [RFC 2396]. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 13/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) URI schemes URI schemes are managed by Internet Assigned Numbers Authority (IANA). Scheme Name Description Reference Type ftp File Transfer Protocol RFC 1738 server-based http Hypertext Transfer Protocol RFC 2616 server-based mailto Electronic mail address RFC 2368 opaque
Recommended publications
  • Unicode Nearly Plain-Text Encoding of Mathematics Murray Sargent III Office Authoring Services, Microsoft Corporation 4-Apr-06
    Unicode Nearly Plain Text Encoding of Mathematics Unicode Nearly Plain-Text Encoding of Mathematics Murray Sargent III Office Authoring Services, Microsoft Corporation 4-Apr-06 1. Introduction ............................................................................................................ 2 2. Encoding Simple Math Expressions ...................................................................... 3 2.1 Fractions .......................................................................................................... 4 2.2 Subscripts and Superscripts........................................................................... 6 2.3 Use of the Blank (Space) Character ............................................................... 7 3. Encoding Other Math Expressions ........................................................................ 8 3.1 Delimiters ........................................................................................................ 8 3.2 Literal Operators ........................................................................................... 10 3.3 Prescripts and Above/Below Scripts........................................................... 11 3.4 n-ary Operators ............................................................................................. 11 3.5 Mathematical Functions ............................................................................... 12 3.6 Square Roots and Radicals ........................................................................... 13 3.7 Enclosures.....................................................................................................
    [Show full text]
  • Using Lexical Tools to Convert Unicode Characters to ASCII
    Using Lexical Tools to Convert Unicode Characters to ASCII Chris J. Lu, Ph.D.1, Allen C. Browne2, Divita Guy1 1Lockheed Martin/MSD, Bethesda, MD; 2National Library of Medicine, Bethesda, MD Abstract 2-3. Recursive algorithm Some Unicode characters require multiple steps, combining Unicode is an industry standard allowing computers to the methods described above, in ASCII conversion. A consistently represent and manipulate text expressed in recursive algorithm is implemented in LVG flow, -f:q7, for most of the world’s writing systems. It is widely used in this purpose. multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only 2-4. Table lookup mapping and strip dealing with ASCII characters. This paper describes Some Unicode characters do not belong to categories methods of utilizing lexical tools to convert Unicode mentioned above. A local table of conversion values is characters (UTF-8) to ASCII (7-bit) characters. used to convert these to an ASCII representation. For example, Greek letters are converted into fully spelled out 1. Introduction forms. ‘α’ is converted to “alpha”. Non-defined Unicode The SPECIALIST Lexical tools, 2008, distributed by characters, (such as ™, ©, ®, etc.), are treated as National Library of Medicine (NLM) provide several stopwords and stripped out completely to ensure pure functions, called LVG (Lexical Variant Generation) flow ASCII conversion. This table lookup and strip function is components, to convert Unicode characters to ASCII. In represented by LVG flow, -f:q8. general, ASCII conversion either preserves semantic and/or 2-5. Pure ASCII conversions graphic representation or facilitates NLP.
    [Show full text]
  • L2/20-136 Reference: Action: for Consideration by JTC1/SC2 Date: Apri 29, 2020
    Doc Type: Liaison Contribution Title: Comments on and proposed response to SC2 N4710, LS on Unicode symbol numbers representing disasters Status: Circulated for information Source: Unicode Consortium Source L2/20-136 reference: Action: For consideration by JTC1/SC2 Date: Apri 29, 2020 In N4710, 3GPP/CT1 requests action from JTC1/SC2 to respond to the following questions: • Question 1. 3GPP CT WG1 would like to ask ISO/IEC JTC1/SC2 to provide what Unicode symbols defined in ISO/IEC 10646 are possible to be recommended as language-independent contents such as pictograms mapping to disasters (earthquake, tsunami, fire, flood, typhoon, hurricane, cyclone, tornado, volcanic eruption, epidemic and chemical hazard) that can be compatible with Unicode-based texts and can be included in public warning message. • Question 2. If there are no proper Unicode symbols recommendable for the ePWS purpose to address the language issue existed in text-based warning messages in ISO/IEC 10646, 3GPP CT WG1 would like to request ISO/IEC JTC1/SC2 the standardization of Unicode-based language- independent contents representing earthquake, tsunami, fire, flood, typhoon, hurricane, cyclone, tornado, volcanic eruption, epidemic and chemical hazard. In short, this is a request for SC2 to devise a system of pictographic symbols for a particular application, and to include characters for these pictographs in the UCS (ISO/IEC 10646). The background given for this request is that 3GPP/CT1 is working on improvement to a text-based public warning service used in 5G communications (PWS) and would like to provide language-neutral symbols for public communication in disaster scenarios.
    [Show full text]
  • UTR #25: Unicode and Mathematics
    UTR #25: Unicode and Mathematics http://www.unicode.org/reports/tr25/tr25-5.html Technical Reports Draft Unicode Technical Report #25 UNICODE SUPPORT FOR MATHEMATICS Version 1.0 Authors Barbara Beeton ([email protected]), Asmus Freytag ([email protected]), Murray Sargent III ([email protected]) Date 2002-05-08 This Version http://www.unicode.org/unicode/reports/tr25/tr25-5.html Previous Version http://www.unicode.org/unicode/reports/tr25/tr25-4.html Latest Version http://www.unicode.org/unicode/reports/tr25 Tracking Number 5 Summary Starting with version 3.2, Unicode includes virtually all of the standard characters used in mathematics. This set supports a variety of math applications on computers, including document presentation languages like TeX, math markup languages like MathML, computer algebra languages like OpenMath, internal representations of mathematics in systems like Mathematica and MathCAD, computer programs, and plain text. This technical report describes the Unicode mathematics character groups and gives some of their default math properties. Status This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. Please send comments to the authors. A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
    [Show full text]
  • Proposal for Addition of Group Mark Symbol Introduction
    Proposal for addition of Group Mark symbol Ken Shirriff github.com/shirriff/groupmark Feb 14, 2015 Abstract The group mark symbol was part of the character set of important computers of the 1960s and 1970s, such as the IBM 705, 7070, 1401 and 1620. Books about these computers and manuals for them used the symbol to represent the group mark in running text. Unfortunately, this symbol does not exist in Unicode, making it difficult to write about technical details of these historical computers. Introduction The group mark was introduced in the 1950s as a separator character for I/O operations. In written text, the group mark is indicated by the symbol: . Unicode doesn't include this symbol, which is inconvenient when writing about the group mark or the character set of these computers. The group mark became part of IBM's Standard BCD Interchange Code (BCDIC) in 1962. [1, page 20 and figure 56]. This standard was used by the IBM 1401, 1440, 1410, 1460, 7010, 7040, and 7044 data processing systems [2]. The BCDIC standard provided consistent definitions of codes and the relation between these codes and printed symbols, including uniform graphics for publications.[2, page A-2]. Unicode can represent all the characters in BCDIC, except for the group mark. This proposal is for a Unicode code point for the group mark for use in running text with an associated glyph . This is orthogonal to the use of the group mark as a separator in data files. While the proposed group mark could be used in data files, that is not the focus of this proposal.
    [Show full text]
  • Unicode Nearly Plain Text Encoding of Mathematics
    Unicode Nearly Plain Text Encoding of Mathematics Unicode Nearly Plain-Text Encoding of Mathematics Murray Sargent III Office Authoring Services, Microsoft Corporation 28-Aug-06 1. Introduction ............................................................................................................ 2 2. Encoding Simple Math Expressions ...................................................................... 3 2.1 Fractions .......................................................................................................... 4 2.2 Subscripts and Superscripts........................................................................... 6 2.3 Use of the Blank (Space) Character ............................................................... 7 3. Encoding Other Math Expressions ........................................................................ 8 3.1 Delimiters ........................................................................................................ 8 3.2 Literal Operators ........................................................................................... 10 3.3 Prescripts and Above/Below Scripts .......................................................... 11 3.4 n-ary Operators ............................................................................................. 11 3.5 Mathematical Functions ............................................................................... 12 3.6 Square Roots and Radicals ........................................................................... 13 3.7 Enclosures ....................................................................................................
    [Show full text]
  • A Proposed Set of Mnemonic Symbolic Glyphs for the Visual Representation of C0 Controls and Other Nonprintable ASCII Characters Michael P
    Michael P. Frank My-ASCII-Glyphs-v2.3.doc 9/14/06 A Proposed Set of Mnemonic Symbolic Glyphs for the Visual Representation of C0 Controls and Other Nonprintable ASCII Characters Michael P. Frank FAMU-FSU College of Engineering Thursday, September 14, 2006 Abstract. In this document, we propose some single-character mnemonic, symbolic representations of the nonprintable ASCII characters based on Unicode characters that are present in widely-available fonts. The symbols chosen here are intended to be evocative of the originally intended meaning of the corresponding ASCII character, or, in some cases, a widespread de facto present meaning. In many cases, the control pictures that we suggest are similar to those specified in ANSI X3.32/ISO 2047, which has not been uniformly conformed to for representing the control codes in all of the widely available Unicode fonts. In some cases, we suggest alternative glyphs that we believe are more intuitive than ISO 2047’s choices. 1. Introduction The most current official standard specification defining the 7-bit ASCII character set is ANSI X3.4-1986, presently distributed as ANSI/INCITS 4-1986 (R2002). This character set includes 34 characters that have no visible, non-blank graphical representation, namely the 33 control characters (codes 00 16 -1F 16 and 7F 16 ) and the space character (code 20 16 ). In some situations, for readability and compactness, it may be desired to display or print concise graphic symbolic representations of these characters, consisting of only a single glyph per character, rather than a multi-character character sequence. According to ANSI X3.4-1986, graphic representations of control characters may be found in American National Standard Graphic Representation of the Control Characters of American National Standard Code for Information Interchange , ANSI X3.32-1973.
    [Show full text]
  • Unicode Characters in Proofpower Through Lualatex
    Unicode Characters in ProofPower through Lualatex Roger Bishop Jones Abstract This document serves to establish what characters render like in utf8 ProofPower documents prepared using lualatex. Created 2019 http://www.rbjones.com/rbjpub/pp/doc/t055.pdf © Roger Bishop Jones; Licenced under Gnu LGPL Contents 1 Prelude 2 2 Changes 2 2.1 Recent Changes .......................................... 2 2.2 Changes Under Consideration ................................... 2 2.3 Issues ............................................... 2 3 Introduction 3 4 Mathematical operators and symbols in Unicode 3 5 Dedicated blocks 3 5.1 Mathematical Operators block .................................. 3 5.2 Supplemental Mathematical Operators block ........................... 4 5.3 Mathematical Alphanumeric Symbols block ........................... 4 5.4 Letterlike Symbols block ..................................... 6 5.5 Miscellaneous Mathematical Symbols-A block .......................... 7 5.6 Miscellaneous Mathematical Symbols-B block .......................... 7 5.7 Miscellaneous Technical block .................................. 7 5.8 Geometric Shapes block ...................................... 8 5.9 Miscellaneous Symbols and Arrows block ............................. 9 5.10 Arrows block ........................................... 9 5.11 Supplemental Arrows-A block .................................. 10 5.12 Supplemental Arrows-B block ................................... 10 5.13 Combining Diacritical Marks for Symbols block ......................... 11 5.14
    [Show full text]
  • Unicodemath a Nearly Plain-Text Encoding of Mathematics Version 3.1 Murray Sargent III Microsoft Corporation 16-Nov-16
    Unicode Nearly Plain Text Encoding of Mathematics UnicodeMath A Nearly Plain-Text Encoding of Mathematics Version 3.1 Murray Sargent III Microsoft Corporation 16-Nov-16 1. Introduction ............................................................................................................ 2 2. Encoding Simple Math Expressions ...................................................................... 3 2.1 Fractions .......................................................................................................... 4 2.2 Subscripts and Superscripts........................................................................... 6 2.3 Use of the Blank (Space) Character ............................................................... 8 3. Encoding Other Math Expressions ........................................................................ 8 3.1 Delimiters ........................................................................................................ 8 3.2 Literal Operators ........................................................................................... 11 3.3 Prescripts and Above/Below Scripts ........................................................... 11 3.4 n-ary Operators ............................................................................................. 12 3.5 Mathematical Functions ............................................................................... 13 3.6 Square Roots and Radicals ........................................................................... 14 3.7 Enclosures ....................................................................................................
    [Show full text]
  • Glossary of Mathematical Symbols
    Glossary of mathematical symbols A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics. The most basic symbols are the decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), and the letters of the Latin alphabet. The decimal digits are used for representing numbers through the Hindu–Arabic numeral system. Historically, upper-case letters were used for representing points in geometry, and lower-case letters were used for variables and constants. Letters are used for representing many other sort of mathematical objects. As the number of these sorts has dramatically increased in modern mathematics, the Greek alphabet and some Hebrew letters are also used. In mathematical formulas, the standard typeface is italic type for Latin letters and lower-case Greek letters, and upright type for upper case Greek letters. For having more symbols, other typefaces are also used, mainly boldface , script typeface (the lower-case script face is rarely used because of the possible confusion with the standard face), German fraktur , and blackboard bold (the other letters are rarely used in this face, or their use is unconventional). The use of Latin and Greek letters as symbols for denoting mathematical objects is not described in this article. For such uses, see Variable (mathematics) and List of mathematical constants. However, some symbols that are described here have the same shape as the letter from which they are derived, such as and .
    [Show full text]
  • A Modular Architecture for Unicode Text Compression
    A modular architecture for Unicode text compression Adam Gleave St John’s College A dissertation submitted to the University of Cambridge in partial fulfilment of the requirements for the degree of Master of Philosophy in Advanced Computer Science University of Cambridge Computer Laboratory William Gates Building 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom Email: [email protected] 22nd July 2016 Declaration I Adam Gleave of St John’s College, being a candidate for the M.Phil in Advanced Computer Science, hereby declare that this report and the work described in it are my own work, unaided except as may be specified below, and that the report does not contain material that has already been used to any substantial extent for a comparable purpose. Total word count: 11503 Signed: Date: This dissertation is copyright c 2016 Adam Gleave. All trademarks used in this dissertation are hereby acknowledged. Acknowledgements I would like to thank Dr. Christian Steinruecken for his invaluable advice and encouragement throughout the project. I am also grateful to Prof. Zoubin Ghahramani for his guidance and suggestions for extensions to this work. I would further like to thank Olivia Wiles and Shashwat Silas for their comments on drafts of this dissertation. I would also like to express my gratitude to Maria Lomelí García for a fruitful discussion on the relationship between several stochastic processes considered in this dissertation. Abstract Conventional compressors operate on single bytes. This works well on ASCII text, where each character is one byte. However, it fares poorly on UTF-8 texts, where characters can span multiple bytes.
    [Show full text]
  • Best Practice When Using Non-Alphabetic Characters in Orthographies: Helping Languages Succeed in the Modern World
    Best practice when using non-alphabetic characters in orthographies: Helping languages succeed in the modern world May 4, 2018 1. Why this document? 2. Word-forming characteristics of a symbol used for tone 3. Altering software programs 4. Linguistic and readability issues 5. Technical recommendations for selecting symbols 5.1. Word-forming characters 5.2. Non-word-forming characters 5.3. Adding characters to Unicode 6. Some specific uses of common symbols Hyphen Apostrophe Asterisk 7. Case Studies Case Study #1: Hyphens Case Study #2: Can you change the program for me? Case Study #3: A successful Unicode proposal, but... Case Study #4: Paratext remarks APPENDIX A: List of non-alphabetic symbols APPENDIX B: “Text Engineering” requirements REFERENCES (and useful Links) 1. Why this document? The goal of this document is to suggest best practices for using non-alphabetic symbols in Roman (Latin) script orthographies, for tone and other functions. We especially want to assist people dealing with marking grammatical tone, though others using non-alphabetical symbols such as apostrophes will find this helpful as well. “Best practices” here involves making orthographies as easy as possible for the user and enabling the orthography to be used as widely as possible on today's computers and mobile devices. Linguistic and sociopolitical factors are vital in orthography, but this paper focuses on the need to be compliant with Unicode so orthographies can be used on modern devices. An orthography that ignores such standards risks problems in text interchanges, and may end up being marginalized by the world at large. These are the Best Practices that are expounded throughout this paper.
    [Show full text]