XML and Semantic Web Technologies II. XML / 1. Unicode, Uris, And

XML and Semantic Web Technologies XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Economics and Information Systems & Institute of Computer Science University of Hildesheim http://www.ismll.uni-hildesheim.de Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax 1. Unicode 2. Uniform Resource Identifiers (URIs) 3. XML Syntax Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Semantic Web Layer Cake Figure 1: Semantic Web Layers (Berners-Lee). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Coded Character Sets name codes examples ASCII code 0–127 64 A 7→ ISO-8859-1, ISO-LATIN-1 0–255 0–127 as ASCII, 196 7→ ISO-8859-7 0–255 0–127 as ASCII, 225 α 7→ Unicode 0–(232 1) 0–255 as ISO-8859-1 − Unicode is organized in 256 groups à 256 planes à 256 rows à 256 cells. Plane 0 (codes 0–65535) is called basis multilingual plane (BMP). Non ISO-8859-1 characters are mapped to higher codes, e.g., 945 α. 7→ Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 2/42 XML and Semantic Web Technologies / 1. Unicode Unicode Assigned characters of the Unicode standard (v6.0.0, 2011) can be found at http://www.unicode.org/charts/. Unicode also specifies character classes for each character, as letters (capital and small), • digits, • punctuation, • control characters. • Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 3/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Scripts Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 4/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Symbols Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 5/42 XML and Semantic Web Technologies / 1. Unicode Character Encoding Schemata coded character codes character byte characters character set (natural numbers) encoding schema sequences Character Encoding Schemata are trivial for 1-byte coded character sets. Direct representations of Unicode: UCS-2: direct representation of codes 0–65535 with 2 bytes. UCS-4: direct representation of all codes with 4 bytes. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 6/42 XML and Semantic Web Technologies / 1. Unicode Drawbacks of direct representations: bytecode 0x00 occurs (that marks string endings in • C), e.g., in UCS-4: A 65 (0, 0, 0, 65) 7→ 7→ uniform blow-up of storage space, but most texts • mostly use ASCII or ISO-8859-1. error-prone, as if one byte is lost, all following data will • be decoded incorrectly. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 7/42 XML and Semantic Web Technologies / 1. Unicode Unicode Transformation Formats (UTF) Unicode Transformation Formats (UTF) use a variable number of bytes for coding a character. UTF-8: 0x00-0x7f (bit sequences 0........) code ASCII characters directly, 0xc0-0xfd (bit sequences 11.......) mark the start of a multi-byte character representation (and code its length and leading bits of its code), 0x80-0xbf (bit sequences 10.......) code continuations of multi-byte character representations, 0xfe, 0xff (bit sequences 1111111.) are not used. bit sequence bytes free bits character codes 0....... 1 7 0x00– 0x7f 110..... 10...... 2 5 + 6 = 11 0x80– 0x7ff 1110.... 10...... 10...... 3 4 + 2 6 = 16 0x800– 0xffff · 11110... 10...... 10...... 10...... 4 3 + 3 6 = 21 0x10000– 0x1fffff · 111110.. 10...... 10...... 10...... 10...... 5 2 + 4 6 = 26 0x200000– 0x3ffffff · 1111110. 10...... 10...... 10...... 10...... 10...... 6 1 + 5 6 = 31 0x4000000–0xffffffff · Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 8/42 XML and Semantic Web Technologies II. XML / 1. Unicode, URIs, and XML Syntax 1. Unicode 2. Uniform Resource Identifiers (URIs) 3. XML Syntax Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Uniform Resource Identifiers (URIs) URIs are used to identify resources. Example: http://www.ismll.uni-hildesheim.de/lehre/xml-09s/index.html URIs are defined in RFC 3986 (01/2005). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Generic URI syntax /FE1!G/ IH "!$# % '&)(*",+-.!0/132*(4!05687*/ BD D ¦=<>?<> ¡ ¡ ¤ ¤ >Q ¢¡£¡¥¤§¦ ¡ M7ON P .!GO# N R L AWV SUT 9BAC: @ :- @KJ 9¢¨;: ¨ © | {z } | {z } | {z } | {z } | {z } :- 9BA-X:3V J*T | {z } ¨ © ZY[¨ F©FX\?© A£: @ @KJ | {z } ¤¦¢¨ ! ¡¦"¢$#&%(' £¢*)!+-,.¢/ ' 102%43 , ¡£¢¥¤§¦©¨ 56 879 ;:< =?>6 @:BADCFE £ | {z } | {z } Figure 6: Typical parts of URIs. Generic URI syntax: URI := scheme : scheme-specific-part h i h i h i Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 10/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Hierarchical URIs An URI is called hierarchical iff scheme-specific-part :=( // authority [ path ] h i h i h i | path )[ ? query ][ # fragment h i h i h i ] path :=( / path-segment )+ h i h i otherwise its called opaque. The path-segments . and .. have special meaning: context path and parent path. A hierarchical URI is called server-based iff authority :=[ userinfo @ ] host [ : port ] h i h i h i h i otherwise it is called registry-based. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 11/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Fragment identifiers Fragment identifiers are used to identify parts of the resource identified by an URI. Example: http://www.informatik.uni-freiburg.de/xml/books.html#R03 1 <html> 2 <body> 3 <li><a name="EE04">Rainer Eckstein, Silke Eckstein: 4 <em>XML und Datenmodellierung</em>, 2004.</a></li> 5 <li><a name="R03">Erik T. Ray: 6 <em>Learning XML</em>, 2003.</a></li> 7 </body> 8 </html> Figure 7: HTML document at http://www.informatik.uni-freiburg.de/xml/books.html. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 12/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Relative (hierarchical) URIs A relative URI is defined as: relativeURI ::=( // authority [ path ] h i h i h i | path h i | relativePath )[ ? query ] h i h i relativePath := path-segment ( / path-segment )* h i h i h i .----------------------------------------------------------. | .----------------------------------------------------. | | | .----------------------------------------------. | | | | | .----------------------------------------. | | | | | | | .----------------------------------. | | | | | | | | | <relative_reference> | | | | | | | | | ‘----------------------------------’ | | | | | | | | (5.1.1) Base URI embedded in the | | | | | | | | document’s content | | | | | | | ‘----------------------------------------’ | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | | | (message, document, or none). | | | | | ‘----------------------------------------------’ | | | | (5.1.3) URI used to retrieve the entity | | | ‘----------------------------------------------------’ | | (5.1.4) Default Base URI is application-dependent | ‘----------------------------------------------------------’ Figure 8: A Base URI is the context for resolving relative URIs [RFC 2396]. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 13/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) URI schemes URI schemes are managed by Internet Assigned Numbers Authority (IANA). Scheme Name Description Reference Type ftp File Transfer Protocol RFC 1738 server-based http Hypertext Transfer Protocol RFC 2616 server-based mailto Electronic mail address RFC 2368 opaque

XML and Semantic Web Technologies II. XML / 1. Unicode, Uris, And

Unicode Nearly Plain-Text Encoding of Mathematics Murray Sargent III Office Authoring Services, Microsoft Corporation 4-Apr-06

Using Lexical Tools to Convert Unicode Characters to ASCII

L2/20-136 Reference: Action: for Consideration by JTC1/SC2 Date: Apri 29, 2020

UTR #25: Unicode and Mathematics

Proposal for Addition of Group Mark Symbol Introduction

Unicode Nearly Plain Text Encoding of Mathematics

A Proposed Set of Mnemonic Symbolic Glyphs for the Visual Representation of C0 Controls and Other Nonprintable ASCII Characters Michael P

Unicode Characters in Proofpower Through Lualatex

Unicodemath a Nearly Plain-Text Encoding of Mathematics Version 3.1 Murray Sargent III Microsoft Corporation 16-Nov-16

Glossary of Mathematical Symbols

A Modular Architecture for Unicode Text Compression

Best Practice When Using Non-Alphabetic Characters in Orthographies: Helping Languages Succeed in the Modern World