XML and Semantic Web Technologies

XML and Semantic Web Technologies

II. XML / 1. , URIs, and XML Syntax

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL) Institute of Economics and Information Systems & Institute of Computer Science University of Hildesheim http://www.ismll.uni-hildesheim.de

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies

II. XML / 1. Unicode, URIs, and XML Syntax

1. Unicode

2. Uniform Resource Identifiers (URIs)

3. XML Syntax

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Semantic Web Layer Cake

Figure 1: Semantic Web Layers (Berners-Lee).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 1/42 XML and Semantic Web Technologies / 1. Unicode Coded Character Sets

name codes examples ASCII code 0–127 64 A 7→ ISO-8859-1, ISO-LATIN-1 0–255 0–127 as ASCII, 196 7→ ISO-8859-7 0–255 0–127 as ASCII, 225 α 7→ Unicode 0–(232 1) 0–255 as ISO-8859-1 −

Unicode is organized in 256 groups à 256 planes à 256 rows à 256 cells.

Plane 0 (codes 0–65535) is called basis multilingual (BMP).

Non ISO-8859-1 characters are mapped to higher codes, e.g., 945 α. 7→

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 2/42 XML and Semantic Web Technologies / 1. Unicode Unicode

Assigned characters of the Unicode standard (v6.0.0, 2011) can be found at http://www.unicode.org/charts/.

Unicode also specifies character classes for each charac- ter, as letters (capital and small), • digits, • , • control characters. •

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 3/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Scripts

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 4/42 XML and Semantic Web Technologies / 1. Unicode Unicode / Symbols

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 5/42 XML and Semantic Web Technologies / 1. Unicode Character Encoding Schemata

coded character codes character byte characters character set (natural numbers) encoding schema sequences

Character Encoding Schemata are trivial for 1-byte coded character sets.

Direct representations of Unicode: UCS-2: direct representation of codes 0–65535 with 2 bytes. UCS-4: direct representation of all codes with 4 bytes.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 6/42 XML and Semantic Web Technologies / 1. Unicode

Drawbacks of direct representations: bytecode 0x00 occurs (that marks string endings in • C), e.g., in UCS-4: A 65 (0, 0, 0, 65) 7→ 7→

uniform blow-up of storage , but most texts • mostly use ASCII or ISO-8859-1.

error-prone, as if one byte is lost, all following data will • be decoded incorrectly.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 7/42 XML and Semantic Web Technologies / 1. Unicode Unicode Transformation Formats (UTF)

Unicode Transformation Formats (UTF) use a variable number of bytes for coding a character. UTF-8: 0x00-0x7f (bit sequences 0...... ) code ASCII characters directly, 0xc0-0xfd (bit sequences 11...... ) mark the start of a multi-byte character repre- sentation (and code its length and leading bits of its code), 0x80-0xbf (bit sequences 10...... ) code continuations of multi-byte character representations, 0xfe, 0xff (bit sequences 1111111.) are not used.

bit sequence bytes free bits character codes 0...... 1 7 0x00– 0x7f 110..... 10...... 2 5 + 6 = 11 0x80– 0x7ff 1110.... 10...... 10...... 3 4 + 2 6 = 16 0x800– 0xffff · 11110... 10...... 10...... 10...... 4 3 + 3 6 = 21 0x10000– 0x1fffff · 111110.. 10...... 10...... 10...... 10...... 5 2 + 4 6 = 26 0x200000– 0x3ffffff · 1111110. 10...... 10...... 10...... 10...... 10...... 6 1 + 5 6 = 31 0x4000000–0xffffffff ·

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 8/42 XML and Semantic Web Technologies

II. XML / 1. Unicode, URIs, and XML Syntax

1. Unicode

2. Uniform Resource Identifiers (URIs)

3. XML Syntax

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Uniform Resource Identifiers (URIs)

URIs are used to identify resources.

Example:

http://www.ismll.uni-hildesheim.de/lehre/xml-09s/index.html

URIs are defined in RFC 3986 (01/2005).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 9/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs)

Generic URI syntax

/FE1!G/ IH

 "!$# % '&)(*",+-.!0/132*(4!05687*/ BD  D

¦=<>?<>

¡ ¡ ¤ ¤

>Q

¢¡£¡¥¤§¦ ¡

M7ON P .!G O# N R

L

AWV

SUT

9BAC:

@

:-

@KJ

9¢¨;:

¨ ©  

| {z } | {z } | {z } | {z } | {z }

:- 9BA-X:3V J*T

| {z }

¨ ©   ZY[¨ F©FX\?© A£:

@ @KJ

| {z }

¤¦¢¨ ! ¡¦"¢$#&%(' £¢*)!+ -,.¢/ ' 102%43 ,

¡£¢¥¤§¦©¨

 56 879 ;:< =?>6 @:BADCFE

 £  | {z } | {z }

Figure 6: Typical parts of URIs.

Generic URI syntax: URI := scheme : scheme-specific-part h i h i h i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 10/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Hierarchical URIs

An URI is called hierarchical iff scheme-specific-part :=( // authority [ path ] h i h i h i | path )[ ? query ][ # fragment h i h i h i ] path :=( / path-segment )+ h i h i otherwise its called opaque.

The path-segments . and .. have special meaning: context path and parent path.

A hierarchical URI is called server-based iff authority :=[ userinfo @ ] host [ : port ] h i h i h i h i otherwise it is called registry-based.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 11/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Fragment identifiers

Fragment identifiers are used to identify parts of the resource identified by an URI.

Example: http://www.informatik.uni-freiburg.de/xml/books.html#R03

1

2

3

  • Rainer Eckstein, Silke Eckstein:

    4 XML und Datenmodellierung, 2004.

  • 5

  • Erik T. Ray:

    6 Learning XML, 2003.

  • 7

    8 Figure 7: HTML document at http://www.informatik.uni-freiburg.de/xml/books.html.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 12/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Relative (hierarchical) URIs A relative URI is defined as: relativeURI ::=( // authority [ path ] h i h i h i | path h i | relativePath )[ ? query ] h i h i relativePath := path-segment ( / path-segment )* h i h i h i .------. | .------. | | | .------. | | | | | .------. | | | | | | | .------. | | | | | | | | | | | | | | | | | | ‘------’ | | | | | | | | (5.1.1) Base URI embedded in the | | | | | | | | document’s content | | | | | | | ‘------’ | | | | | | (5.1.2) Base URI of the encapsulating entity | | | | | | (message, document, or none). | | | | | ‘------’ | | | | (5.1.3) URI used to retrieve the entity | | | ‘------’ | | (5.1.4) Default Base URI is application-dependent | ‘------’

    Figure 8: A Base URI is the context for resolving relative URIs [RFC 2396]. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 13/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) URI schemes URI schemes are managed by Internet Assigned Numbers Authority (IANA).

    Scheme Name Description Reference Type ftp File Transfer Protocol RFC 1738 server-based http Hypertext Transfer Protocol RFC 2616 server-based mailto Electronic mail address RFC 2368 opaque file Host-specific file names RFC 1738 server-based pop Post Office Protocol v3 RFC 2384 server-based dav dav RFC 2518 server-based tel telephone RFC 2806 opaque https Hypertext Transfer Protocol Secure RFC 2818 server-based urn Uniform Resource Names RFC 2141 opaque . . . .

    66 URI schemes (as of 2009-04-06; http://www.iana.org/assignments/uri-schemes.html). URI registrations are regulated by RFC 4396 (2/2006). Example: tel:+(49)-761-203-8164

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 14/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) URI types by URI semantics

    Uniform Resource Identifier (URI)

    Uniform Resource Uniform Resource Uniform Resource Locator (URL) Name (URN) Characteristics (URC)

    Figure 9: URI types.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 15/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Uniform Resource Names (URNs) URNs are special kinds of URIs that map other namespaces into URN-space, • are required to remain globally unique and persistent • (even when the resource ceases to exist or becomes unavailable). have scheme urn. •

    URN := urn: namespace : namespace-specific-part h i h i h i

    Examples: urn:isbn:0-395-36341-1 urn:newsml:reuters.com:20000206:IIMFFH05643_2000-02-06_17-54-01_L06156584:1U

    A book or a news item (identified by an URN) may be retrieved from different locations (URLs).

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 16/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs)

    URN Namespaces Value Reference ietf 1 RFC 2648 pin 2 RFC 3043 issn 3 RFC 3044 oid 4 RFC 3061 newsml 5 RFC 3085 oasis 6 RFC 3121 xmlorg 7 RFC 3120 publicid 8 RFC 3151 isbn 9 RFC 3187 nbn 10 RFC 3188 web3d 11 RFC 3541 mpeg 12 RFC 3614 mace 13 RFC 3613 fipa 14 RFC 3616 swift 15 RFC 3615 . . . 40 URN namespaces (as of 2008-12-09; http://www.iana.org/assignments/urn-namespaces) Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 17/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Characters Allowed in URIs

    In URIs only some characters may be used literally in non-syntactic parts ("data").

    All others have to be escaped using their code (in some character encoding):

    dataChars := alphanum | - | _ | . | ! | ~ | * | ’ | ( | ) h i h i escapedChar := % hexDigit hexDigit h i h i h i

    Codes have been interpreted as codes in different character encodings, depending on the URI scheme.

    UTF-8 is recommended by RFC 2718 and already used by some schemes (e.g., urn, imap, pop).

    Example: http://www.informatik.uni-freiburg.de/login.jsp?name=Hans%20Meyer

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 18/42 XML and Semantic Web Technologies / 2. Uniform Resource Identifiers (URIs) Internationalized Resource Identifiers (IRIs)

    IRIs allow more characters to be used literally (RFC 3987; 01/2005).

    In IRIs only data characters that can be misinterpreted as syntactic characters and • some bidirectional formatting characters • have to be escaped.

    All other data characters are used literally (in some character encoding, e.g., UTF-8).

    Example: http://www.informatik.uni-freiburg.de/login.jsp?name=Hans%20Müller

    Schemes still are restricted to US ASCII characters.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 19/42 XML and Semantic Web Technologies

    II. XML / 1. Unicode, URIs, and XML Syntax

    1. Unicode

    2. Uniform Resource Identifiers (URIs)

    3. XML Syntax

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 20/42 XML and Semantic Web Technologies / 3. XML Syntax W3C development process

    W3C specifications are called Recommendations.

    Stages of W3C recommendations: completion date stage XML 1.0 XML 1.1 Working Draft 1996/11/14 2001/12/13 1997/11/17 Last Call Working Draft 2002/04/25 Candidate Recommendation 2002/10/15 Proposed Recommendation 1997/12/08 2003/11/05 Recommendation 1998/02/10 2004/04/15 Working Draft 2000/08/14 Recommendation (2nd edition) 2000/10/06 2006/08/16 Proposed Edited Recommendation 2003/10/30 Recommendation (3rd edition) 2004/02/04 Recommendation (4th edition) 2006/08/16 Recommendation (5th edition) 2008/11/26

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 20/42 XML and Semantic Web Technologies / 3. XML Syntax

    Every XML document consists of a prolog and a single element, called root ele- ment.

    document := prolog element ( Comment | PI | S )* h i h i h i h i h i h i

    prolog := h i ( Comment | PI | S )* h i h i h i ( DoctypeDecl ( Comment | PI | S )* )? h i h i h i h i In all productions matching " can be replaced by ’. • = may be surrounded by spaces (i.e., match S ? = S ?). • h i h i

    S := (#x20 | #x9 | #xD | #xA)+ h i

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 21/42 XML and Semantic Web Technologies / 3. XML Syntax A minimal XML document

    1

    2 Figure 10: A minimal XML document with root element "page".

    In XML 1.1 the version attribute is mandatory.

    If the version attribute is missing, version 1.0 is assumed.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 22/42 XML and Semantic Web Technologies / 3. XML Syntax Elements and Attributes

    element := emptyElementTag h i h i | STag content ETag h i h i h i emptyElementTag := < Name ( S Name = " AttValue " )* S ? /> h i h i h i h i h i h i STag := < Name ( S Name = " AttValue " )* S ? > h i h i h i h i h i h i ETag := h i h i h i Name s h i start with a unicode letter or _ • (: is also allowed, but used for namespaces). may contain unicode letters, uncode digits, -, ., or . • · A wellformed document requires, that start and end tag of each element match, • that for each tag the same attribute never occurs twice. •

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 23/42 XML and Semantic Web Technologies / 3. XML Syntax Not-wellformed Documents (1/2)

    1

    2

    3 RainerEckstein

    4 SilkeEckstein

    5 XML und Datenmodellierung

    6 2004

    7

    8

    9 Erik T.Ray

    10 Learning XML

    11 2003

    12 Figure 11: More than one root element.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 24/42 XML and Semantic Web Technologies / 3. XML Syntax Not-wellformed Documents (2/2)

    1

    2

    3 Erik T.Ray

    4 Learning XML

    5 2003

    6 Figure 12: Miss-nested elements.

    1

    2

    3 XML und Datenmodellierung

    4 2004

    5 Figure 13: Same attribute occurs twice.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 25/42 XML and Semantic Web Technologies / 3. XML Syntax Element content

    The contents of an element can be made up from 6 different things: 1. other elements, 2. Character data, 3. References, 4. CDATA sections, 5. Processing instructions, and 6. comments.

    content := CharData ? h i h i (( element | Reference | CDSect | PI | Comment ) h i h i h i h i h i CharData ? )* h i

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 26/42 XML and Semantic Web Technologies / 3. XML Syntax Character data

    CharData may contain any characters except h i <, &, or the sequence >]]

    Attribute values may not contain ", if delimited by ", • ’, if delimited by ’, •

    These characters can be expressed by references.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 27/42 XML and Semantic Web Technologies / 3. XML Syntax Character data

    1

    2

    3 x^2 = y has no real solution for y < 0.

    4 But there are solutions for y = 0 & for y > 0.

    5 Figure 14: Forbidden characters in character data.

    1

    2

    3 x^2 = y has no real solution for y < 0.

    4 But there are solutions for y = 0 & for y > 0.

    5 Figure 15: Using references in character data.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 28/42 XML and Semantic Web Technologies / 3. XML Syntax References

    Reference := EntityRef | CharRef h i h i h i CharRef := &# [0-9]+ ; h i | &#x [0-9a-fA-F]+ ; EntityRef := & Name ; h i h i

    There are five predefined entity references: < > & ' " < > & ’ "

    All other entities known from HTML (as ä) are not predefined in XML.

    Custom entities can be defined in the document type declaration.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 29/42 XML and Semantic Web Technologies / 3. XML Syntax CDATA sections

    CDATA sections allow the literal usage of all characters (except the sequence ]]>).

    CDSect := h i h i

    CDATA sections are typically used for longer text containing < or &.

    CDATA sections are flat, i.e., there is no possibility to structure them with elements (as < or & are interpreted literally).

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 30/42 XML and Semantic Web Technologies / 3. XML Syntax Character data and CDATA sections

    1

    2

    3 x^2 = y has no real solution for y c; 0.

    4 But there are solutions for y = 0  for y e; 0.

    5 Figure 16: Using numeric character references.

    1

    2

    3 x^2 = y has no real solution for y < 0.

    4 But there are solutions for y = 0 & for y > 0.

    5 ]]> Figure 17: Using a CDATA-section.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 31/42 XML and Semantic Web Technologies / 3. XML Syntax Attribute values

    1

    2

    3 John Doe

    4 About wellformedness

    5 Figure 18: Literal usage of attribute delimiter.

    1

    2

    3 John Doe

    4 About wellformedness

    5 Figure 19: Using different attribute delimiters.

    1

    2

    3 John Doe

    4 About wellformedness

    5 Figure 20: Using references in attribute values. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 32/42 XML and Semantic Web Technologies / 3. XML Syntax Comments

    Comments can occur in the prolog and in the contents of elements.

    Comments are not allowed to contain the character sequence --.

    Comment := h i h i

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 33/42 XML and Semantic Web Technologies / 3. XML Syntax Comments

    1

    2

    3

    4

    5

    6 RainerEckstein

    7 SilkeEckstein

    8 XML und Datenmodellierung

    9

    10

    11 Figure 21: Comments in the prolog and in the contents of elements.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 34/42 XML and Semantic Web Technologies / 3. XML Syntax

    1

    2

    3 RainerEckstein

    4 SilkeEckstein

    5 XML und Datenmodellierung

    6 >2004

    7 Figure 22: Comments in tags are not allowed.

    1

    2

    3

    4

    5 RainerEckstein

    6 SilkeEckstein

    7 XML und Datenmodellierung

    8 2004

    9

    10 Figure 23: -- is not allowed in comments. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 35/42 XML and Semantic Web Technologies / 3. XML Syntax Processing Instructions

    Processing instructions (PIs) allow documents to contain instructions for applica- tions.

    PI := h i h i h i h i

    The name of a PI must be different from xml.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 36/42 XML and Semantic Web Technologies / 3. XML Syntax Character encoding schemata

    Character encoding schemata are specified by the name they are registered with at IANA (http://www.iana.org/assignments/character-sets), e.g., US-ASCII ISO-8859-1 ISO-10646-UCS-2 or csUnicode (UCS2) ISO-10646-UCS-4 or csUCS4 (UCS4) UTF-8 UTF-16 ...

    If no encoding is specified in the XML declaration, UTF-8 is assumed.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 37/42 XML and Semantic Web Technologies / 3. XML Syntax Character encoding schemata

    1

    2

    3 Grüß Gott !

    4 Figure 24: Non-wellformed document (assumed to be ISO-8859-1 coded).

    1

    2

    3 Grüß Gott !

    4 Figure 25: XML document coded in ISO-8859-1.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 38/42 XML and Semantic Web Technologies / 3. XML Syntax Language and Whitespaces

    There are two predefined attributes, xml:lang • and

    xml:space, • that can be used with any element. xml:lang specifies the language of the character contents of elements and at- tributes with (RFC 3066) an ISO language code • (http://www.loc.gov/standards/iso639-2/langcodes.html)

    or

    an IANA language code • (http://www.iana.org/assignments/language-tags).

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 39/42 XML and Semantic Web Technologies / 3. XML Syntax Language Attribute

    Example ISO and IANA language codes: language code meaning source de ISO German de-CH ISO German, Swiss variant de-DE ISO German, German variant en ISO English en-US ISO US English en-GB ISO Britain English tlh ISO Klingon de-1901 IANA German, traditional orthography de-1996 IANA German, orthography of 1996 . . .

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 40/42 XML and Semantic Web Technologies / 3. XML Syntax Language Attribute

    1

    2

    3

    Guten Morgen!

    4

    Good morning!

    5

    6

    7

    8

    USD01...
    EUR00.839818...

    9 Figure 26: Language attribute.

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 41/42 XML and Semantic Web Technologies / 3. XML Syntax References

    Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on XML and Semantic Web Technologies, summer term 2012 42/42